Skip to content

ECS Tasks on EC2 experiencing daily CPU spike at exact same time causing health check failures

0

Question: I'm experiencing a recurring issue with my ECS cluster running on EC2 instances. Every day at exactly 00:00 UTC, all ECS tasks in the cluster simultaneously spike to 100% CPU usage for 6-8 minutes, causing ALB health check timeouts and automatic task replacement.

Environment: Launch Type: EC2 (t3.medium instance) Cluster: 3 services running on the same EC2 instance Task Definitions: Different applications (Node.js/NestJS) ALB Health Check: HTTP /ping endpoint, 10s timeout, 60s interval

EC2 Instance

Instance Type: t3.medium
vCPUs: 2
Memory: 4 GiB
Network: Up to 5 Gbps
EBS Bandwidth: Up to 2,085 Mbps
CPU Credits: Unlimited mode
Baseline Performance: 20% per vCPU
ECS Task Definitions
Task 1 :
{
  "cpu": "512",      // 0.5 vCPU
  "memory": "1024",  // 1 GB
  "containerCpu": 512,
  "containerMemory": 1024
}
Task 2:
{
  "cpu": "512",      // 0.5 vCPU
  "memory": "512",   // 512 MB
  "containerCpu": 512,
  "containerMemory": 512
}
Task 3:
{
  "cpu": "512",      // 0.5 vCPU
  "memory": "1024",  // 1 GB
  "containerCpu": 512,
  "containerMemory": 1024
}

Rolling Update Configuration

### Service 1
{
  "deploymentConfiguration": {
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    },
    "maximumPercent": 101,
    "minimumHealthyPercent": 0,
    "strategy": "ROLLING",
    "bakeTimeInMinutes": 0
  },
  "healthCheckGracePeriodSeconds": 0
}
Service 2
{
  "deploymentConfiguration": {
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    },
    "maximumPercent": 101,
    "minimumHealthyPercent": 0,
    "strategy": "ROLLING",
    "bakeTimeInMinutes": 0
  },
  "healthCheckGracePeriodSeconds": 0
}
Service 3
{
  "deploymentConfiguration": {
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    },
    "maximumPercent": 200,
    "minimumHealthyPercent": 100,
    "strategy": "ROLLING",
    "bakeTimeInMinutes": 0,
    "alarms": {
      "alarmNames": [],
      "rollback": false,
      "enable": false
    }
  },
  "healthCheckGracePeriodSeconds": 0
}

Total Allocation: CPU: 1.5 vCPU allocated / 2 vCPU available (75% allocation) Memory: 2.5 GB allocated / 4 GB available (62.5% allocation)

Observed Pattern: Daily at 00:00 UTC:

  • 00:00 - CPU usage jumps from 1% to 63-99%
  • 00:01-00:07 - All 3 tasks maintain 98-100% CPU
  • 00:01:54 - ALB health check timeout (Request timed out)
  • 00:02:04 - ECS stops unhealthy task
  • 00:07:47 - ECS starts new task
  • 00:08+ - CPU returns to normal (1-6%)

CloudWatch Metrics (ECS Service level):

  • Time (UTC) CPU Avg CPU Max Memory Avg Memory Max
  • 00:00 63% 99% 12% 14%
  • 00:01 98% 100% 15% 16%
  • 00:02 99% 99% 17% 18%
  • 00:03 99% 100% 19% 20%
  • 00:08 6% 18% 5% 5% ✅

ALB Metrics: 00:02 - First request: 513ms response time 00:03-00:04 - 44 requests (health check retries) Normal response time: <10ms

Key Observations:

  • ✅ All 3 services (different applications) show identical CPU spike pattern

  • ✅ Happens exactly at 00:00 UTC every day

  • ✅ Memory usage is normal (max 25%)

  • ✅ EC2 instance CPU Credits are sufficient (576 credits)

  • ✅ Test cluster without ALB on same infrastructure: No issues (steady state every 6 hours)

  • ❌ No scheduled tasks/cron jobs at midnight UTC in applications

  • ❌ No EC2 scheduled events

  • ❌ No EventBridge rules or Auto Scaling scheduled actions

  • ❌ No logrotate or system maintenance at that time

Already Checked: Application code: No cron jobs scheduled for 00:00 UTC Winston logging: DailyRotateFile disabled (previously suspected) EC2 instance: t3.medium, unlimited mode, CPU credits available ECS Agent logs: Shows normal task stop/start commands from ACS System logs: No cron/timer triggers at midnight

Questions: Is there any AWS ECS/EC2 internal maintenance or process that runs at midnight UTC? Could ALB health checks themselves cause this CPU spike pattern? Are there any hidden scheduled actions at the ECS Control Plane level? Why would all tasks spike CPU simultaneously at the exact same time daily?

Additional Context: This pattern started occurring daily without any deployment changes Previously had OOM issues (fixed by disabling file logging) Current issue is CPU-based, not memory Same codebase in test cluster (without ALB) runs fine without daily spikes Any insights into what could cause such a precise, recurring CPU spike pattern would be greatly appreciated!

1 Answer
-1

Based on your detailed description, this appears to be a classic case of a scheduled system-level process causing high CPU utilization across all your ECS tasks at exactly midnight UTC. Here are some potential causes and solutions:

  1. Log rotation or system maintenance: Even though you've checked for logrotate, there might be other system maintenance tasks running at midnight UTC. Many Linux distributions have default maintenance tasks scheduled at midnight, including log rotation, system updates, or cleanup processes.

  2. CloudWatch Logs agent: If your tasks are sending logs to CloudWatch, the CloudWatch Logs agent might be performing batch processing or log rotation at midnight, causing CPU spikes.

  3. Container resource competition: Since all three services run on the same EC2 instance, a resource-intensive process in one container could affect others by competing for CPU resources.

To troubleshoot this issue:

  1. Check application logs: Review your application logs during the spike period to identify any patterns or errors that might indicate what's causing the high CPU usage.

  2. Monitor system-level metrics: Use commands like top, sar, or vmstat on the EC2 instance during the spike to identify which processes are consuming CPU. The sar command can be particularly useful for collecting CPU metrics at regular intervals.

  3. Set up alarms and detailed monitoring: Create CloudWatch alarms for metrics such as CPU utilization, memory usage, and application error rates to alert you when thresholds are exceeded.

  4. Implement Container Insights: Enable Container Insights for more detailed task and container level resource usage metrics, which can help pinpoint which specific container is causing the issue.

  5. Resource allocation adjustments: If the issue persists, consider:

  • Horizontally scaling your tasks across more EC2 instances to distribute the load
  • Vertically scaling by increasing the CPU units allocated in your task definitions
  • Using a larger EC2 instance type with more CPU capacity
  1. Implement health check adjustments: Consider increasing your health check grace period to prevent task replacement during these predictable spikes if they can't be eliminated.

  2. Implement task startup optimization: If your GPU-based tasks take time to start, ensure you have appropriate startup timeouts configured.

Since this happens at exactly midnight UTC and affects all services simultaneously, it strongly suggests a scheduled system process rather than an application-specific issue. Focus your investigation on system-level scheduled tasks and logging mechanisms first.
Sources
Troubleshoot high CPU on Amazon ECS task | AWS re:Post
Monitor high resource usage in ECS tasks and services | AWS re:Post
ECS connect container unhealthy during new deployments to ECS EC2 | AWS re:Post
Troubleshoot high CPU utilization | AWS re:Post

answered 2 months ago
AWS
EXPERT
reviewed 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.