Hello,
We are seeing jobs remaining stuck in RUNNABLE state in AWS Batch for several days. On checking the AWS console here are the observations:
-
Compute environment is in INVALID state with reason: "CLIENT_ERROR - The instance IDs 'i-019be140af144fccf, i-01bd39c87ac0f87b7, i-034e8d61e0b7d5419, i-0ac46775085e4f9c4, i-0e129e96b5112c045' do not exist". Of these 5 instanceIDs I can see that 2 actually DO exist in the ec2 console while other 3 do not (perhaps they got reclaimed as this is a SPOT managed environment?)
-
In the ECS cluster page I can see that the running EC2 spot instances are registered and visible under "Container Instances" and there are no alerts related to Agent version, etc (Agent version is 1.65.0)
-
In the EC2 console I can see that several spot instances have been created and many of them have been running for 3+ days now.
My Queries:
-
Can you please help me understand why this has happened? We have made ZERO changes to the compute environment and it was working correctly till Feb 1st.
-
What can be done to fix this?
-
Will I be charged for these instances that have been created by batch and have been sitting IDLE for over 3 days? A total of 16 instances had been created by the batch compute environment. I can share the instance ids if needed.