- Newest
- Most votes
- Most comments
Hello.
Would it be possible for you to share the sample code where the problem is occurring?
I suspect that the process is running in parallel, as suggested by the AI's automated response, and is therefore hitting API rate limits.
Incidentally, if you only need to check for CloudFormation drift, you can do so using AWS Config's managed rules.
This way, you can perform the check without writing complex Python code.
https://docs.aws.amazon.com/config/latest/developerguide/cloudformation-stack-drift-detection-check.html
The following blogs may also be helpful.
https://andrii-shykhov.medium.com/cloudformation-drift-detection-and-notification-with-aws-config-remediation-action-0c4891320b38
def _detect_drift_and_poll(self, stack_name:str): delay = 5 try: response = self.cloudformation_client.detect_stack_drift(StackName=stack_name) detection_id = response["StackDriftDetectionId"] except ClientError as e: log.error(f"Failed to start drift: {e}") return {"stack":stack_name, "drift_status": "INITIATION_FAILED"} for attempt in range(MAX_ATTEMPTS): time.sleep(POLLING_INTEVAL) try: status_response = self.cloudformation_client.describe_stack_drift_detection_status( StackDriftDetectionId=detection_id) detection_status = status_response["DetectionStatus"] if detection_status == "DETECTION_COMPLETE": drift_status = status_response.get("StackDriftStatus", "UNKNOWN") drifted_count = status_response.get("DriftedStackResourceCount", 0) log.info(f"[{stack_name}] {drift_status} {drifted_count} resources drifted") return {"stack": stack_name, "drift_status": drift_status, "drifted_resource_count": drifted_count} elif detection_status == "DETECTION_FAILED": log.error(f"[{stack_name}] Detection failed") return {"stack": stack_name, "drift_status": "DETECTION_FAILED"} log.info(f"[{stack_name}] still processing (attempt({attempt + 1}/{MAX_ATTEMPTS} attempts") return {"stack":
Try outputting the reason for the failure to the log as shown below.
log.error( f"[{stack_name}] Detection failed: {reason}" )Also, try setting up retries using boto3's retries feature. https://docs.aws.amazon.com/botocore/latest/reference/config.html
As I mentioned in a previous answer, you can also check CloudFormation drift using AWS Config.
Based on the drift detection behavior you're describing, there are several factors that could be causing the DETECTION_FAILED status in your containerized environment:
Understanding DETECTION_FAILED Status The DETECTION_FAILED status indicates that the stack drift detection operation has failed for at least one resource in the stack. When this occurs, results will only be available for resources on which CloudFormation successfully completed drift detection.
Likely Causes in Your ECS Container Environment
-
IAM Permissions: Your ECS task role may not have sufficient permissions to read all resource types in your stacks. CloudFormation needs appropriate permissions to call the read handlers for each resource type during drift detection. Since manual console operations work (likely using your user credentials with broader permissions), but the containerized version fails, this is a strong possibility.
-
API Rate Limiting: When using thread pooling to process multiple stacks concurrently, you may be hitting AWS API rate limits more aggressively in the container environment. CloudFormation makes multiple API calls to various services during drift detection, and excessive concurrent operations can trigger throttling, leading to detection failures.
-
Timeout Issues: While you've set a 30-second polling interval, the actual drift detection operation for each stack may take longer than expected, especially under concurrent load. The container environment might have different network characteristics or resource constraints compared to your local machine.
Recommendations
- Verify your ECS task role has all necessary read permissions for the resource types in your stacks
- Implement exponential backoff and retry logic for failed detections
- Consider reducing the level of concurrency in your thread pool to avoid rate limiting
- Monitor CloudWatch logs for specific error messages that might indicate the root cause
- Check the DetectionStatusReason field in the response, which provides details about why the detection failed
Sources
describe_stack_drift_detection_status - Boto3 1.42.50 documentation
describe_stack_drift_detection_status - botocore 1.42.50 documentation

Most likely it's timing out. Do you have error handling in your code? Log to CloudWatch Logs for the errors you are receiving and that will tell you the reasons for failure.