- Newest
- Most votes
- Most comments
Based on your detailed description, this appears to be a classic case of a scheduled system-level process causing high CPU utilization across all your ECS tasks at exactly midnight UTC. Here are some potential causes and solutions:
-
Log rotation or system maintenance: Even though you've checked for logrotate, there might be other system maintenance tasks running at midnight UTC. Many Linux distributions have default maintenance tasks scheduled at midnight, including log rotation, system updates, or cleanup processes.
-
CloudWatch Logs agent: If your tasks are sending logs to CloudWatch, the CloudWatch Logs agent might be performing batch processing or log rotation at midnight, causing CPU spikes.
-
Container resource competition: Since all three services run on the same EC2 instance, a resource-intensive process in one container could affect others by competing for CPU resources.
To troubleshoot this issue:
-
Check application logs: Review your application logs during the spike period to identify any patterns or errors that might indicate what's causing the high CPU usage.
-
Monitor system-level metrics: Use commands like
top,sar, orvmstaton the EC2 instance during the spike to identify which processes are consuming CPU. Thesarcommand can be particularly useful for collecting CPU metrics at regular intervals. -
Set up alarms and detailed monitoring: Create CloudWatch alarms for metrics such as CPU utilization, memory usage, and application error rates to alert you when thresholds are exceeded.
-
Implement Container Insights: Enable Container Insights for more detailed task and container level resource usage metrics, which can help pinpoint which specific container is causing the issue.
-
Resource allocation adjustments: If the issue persists, consider:
- Horizontally scaling your tasks across more EC2 instances to distribute the load
- Vertically scaling by increasing the CPU units allocated in your task definitions
- Using a larger EC2 instance type with more CPU capacity
-
Implement health check adjustments: Consider increasing your health check grace period to prevent task replacement during these predictable spikes if they can't be eliminated.
-
Implement task startup optimization: If your GPU-based tasks take time to start, ensure you have appropriate startup timeouts configured.
Since this happens at exactly midnight UTC and affects all services simultaneously, it strongly suggests a scheduled system process rather than an application-specific issue. Focus your investigation on system-level scheduled tasks and logging mechanisms first.
Sources
Troubleshoot high CPU on Amazon ECS task | AWS re:Post
Monitor high resource usage in ECS tasks and services | AWS re:Post
ECS connect container unhealthy during new deployments to ECS EC2 | AWS re:Post
Troubleshoot high CPU utilization | AWS re:Post
Relevant content
- asked 2 years ago
