AWS Batch requesting more VCPU's than tasks require

0

Hi,

We have an AWS Batch compute environment set up to use EC2 spot instances, with no limits on instance type, and with the SPOT_CAPACITY_OPTIMIZED allocation strategy.

We submitted a task requiring 32 VCPUs and 58000MB memory (which is 2GB below the minimum amount of memory for the smallest 32 VCPU instance size, c3.8xlarge, just to leave a bit of headroom), which is reflected in the job status page. We expected to receive an instance with 32 VCPUs and >64GB memory, but received an r4.16xlarge with 64 VCPUs and 488GB memory.

An r4.16xlarge is rather oversized for the single task in the queue, and our task can't take advantage of the extra cores, as we pin the processes to the specified number of cores so multiple tasks scheduled on the same host don't contend over CPU. We have no other tasks in the queue and no currently-running compute instances, nor any desired/minimum set on the compute environment before this task was submitted.

In the autoscaling history, it shows: a user request update of AutoScalingGroup constraints to min: 0, max: 36, desired: 36 changing the desired capacity from 0 to provide the desired capacity of 36

Where did this 36 come from? Surely this should be 32 to match our task?

I'm aware that the docs say: However, AWS Batch might need to exceed maxvCpus to meet your capacity requirements. In this event, AWS Batch never exceeds maxvCpus by more than a single instance. But we're concerned that once we start scaling up, each task will be erroneously requested with 4 extra VCPUs.

I'm guessing this is what happened in this case is due to the SPOT_CAPACITY_OPTIMIZED allocation strategy.

  • Batch probably queried for the best available host to meet our 32 VCPU requirement and got the answer c4.8xlarge, which has 36 cores.
  • Batch then told the autoscaling group to scale to 36 cores, expecting to get a c4.8xlarge from the spot instance request.
  • The spot instance allocation strategy is currently set to SPOT_CAPACITY_OPTIMIZED, which prefers instances that are less likely to be killed (rather than preferring the cheapest/best fitting).
  • The spot instance request looked at the availability of c4.8xlarge and decided that they were too likely to be killed under the SPOT_CAPACITY_OPTIMIZED allocation strategy, and decided to sub it in with the most-available host that matched the 36 core requirement set by batch, which turned out to be an oversized 64 VCPU r5 instead of the better-fitting-for-the-task 32 or 48 VCPU R5.

But the above implies that Batch itself doesn't follow the same logic as the SPOT_CAPACITY_OPTIMIZED, and instead requests the specs of the "best fit" host even if that host will not be provided by the spot request, resulting in potentially significantly oversized hosts.

Alternatively, the 64 VCPU r5 happened to have better availability than the 48/32 VCPU r5, but I don't see how that would be possible, since the 64 VCPU r5 is just 2*the 32 VCPU one, and these are virtualised hosts, so you would expect the availability of the 64 VCPU to be half that of the 32 VCPU one.

Can it be confirmed if either of my guesses here are correct, or if I'm thinking about it the wrong way, or if we missed a configuration setting?

Thanks!

asked a year ago260 views
1 Answer
0

Batch will scale EC2 resources to optimize for both throughput and cost, which means that it may launch larger instances that can handle more than a single job concurrently on the instance. It's not a 1:1 relation between job size to EC2 instance size. While this can be suboptimal for a test run, it works much in your favor when you deploy a production environment that submits more than a single job.

If you want to restrict for a specific use case, then you can do so instanceTypes request parameter to the specific instance sizes that will match your jobs.

AWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions