Automated Network Incident Response: Using Network Flow Monitor to Trigger AWS DevOps Agent Investigations
This article shows how to automatically trigger an AWS DevOps Agent investigation when Amazon CloudWatch Network Flow Monitor (NFM) detects network degradation. When NFM identifies retransmission timeouts between your workloads, a CloudWatch Alarm fires and invokes a Lambda function that sends an HMAC-authenticated webhook to the DevOps Agent, creating an investigation automatically without manual intervention.
The Problem
In modern cloud architectures, east-west traffic between application tiers traverses multiple network components, Transit Gateways, centralized inspection VPCs with Network Firewalls, and complex routing tables. When something breaks in this path, the failures are often intermittent rather than complete. A misconfigured firewall rule might block one availability zone while the other keeps working. Users see random timeouts, but health checks still pass.
These issues are hard to detect because TCP retransmissions happen silently at the network layer. The application retries, some requests succeed on alternate paths, and nobody notices until the problem compounds. When an engineer finally spots it, tracing the issue through TGW route tables, firewall rules, NACLs, and security groups takes time.
The question this article answers: how do you detect network-layer degradation in real time and start investigating immediately, before a human even knows there's a problem.
The Solution
NFM Agent (on EKS nodes)
→ CloudWatch Metric: Timeouts
→ CloudWatch Alarm (Timeouts > threshold)
→ EventBridge Rule
→ Lambda Function (HMAC webhook)
→ AWS DevOps Agent
→ Investigation auto-created
Prerequisites
- An existing workload with Network Flow Monitor configured (agent installed, monitor created)
- An AWS DevOps Agent space with a Generic (HMAC) webhook configured
- Familiarity with CloudWatch Alarms, EventBridge, and Lambda
Step 1: Identify the Correct NFM Metric
NFM publishes metrics under the AWS/NetworkFlowMonitor namespace. The key metrics for detecting degradation are:
| Metric | Description | Use Case |
|---|---|---|
Timeouts | Connection timeout count | Detect blocked or unreachable paths |
Retransmissions | TCP retransmission count | Detect packet loss or congestion |
RoundTripTime | Network latency | Detect latency spikes |
HealthIndicator | Overall health (0=healthy, 1=degraded) | Broad health monitoring |
Important: The dimension is MonitorId with the full monitor ARN as the value, not the monitor name.
# Verify your monitor is publishing metrics aws cloudwatch list-metrics \ --namespace AWS/NetworkFlowMonitor \ --region us-east-1
Example output:
{ "Namespace": "AWS/NetworkFlowMonitor", "MetricName": "Timeouts", "Dimensions": [ { "Name": "MonitorId", "Value": "arn:aws:networkflowmonitor:us-east-1:123456789012:monitor/my-monitor" } ] }
Note: NFM takes 10-15 minutes after monitor creation to start publishing metrics. If you don't see metrics, ensure the NFM agent pods are running and traffic is flowing through the monitored path.
Step 2: Create the CloudWatch Alarm
Create an alarm that fires when timeouts exceed your threshold:
aws cloudwatch put-metric-alarm \ --alarm-name NFM-Alarm-EKS-DB \ --alarm-description "Network degradation detected between EKS and RDS" \ --namespace AWS/NetworkFlowMonitor \ --metric-name Timeouts \ --dimensions Name=MonitorId,Value=arn:aws:networkflowmonitor:us-east-1:123456789012:monitor/my-monitor \ --statistic Sum \ --period 300 \ --evaluation-periods 1 \ --threshold 50 \ --comparison-operator GreaterThanThreshold \ --treat-missing-data notBreaching
Choose your threshold based on your baseline. You can check current values:
aws cloudwatch get-metric-statistics \ --namespace AWS/NetworkFlowMonitor \ --metric-name Timeouts \ --dimensions Name=MonitorId,Value=arn:aws:networkflowmonitor:us-east-1:123456789012:monitor/my-monitor \ --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 300 --statistics Sum
Step 3: Configure the DevOps Agent Webhook
- Open the AWS DevOps Agent console
- Navigate to your Agent Space → Capabilities → Webhooks
- Click Add and select Generic (HMAC)
- Save the generated Webhook URL and Secret Key
Step 4: Create the Lambda Function
Create a Lambda function (Python 3.12, 30s timeout) that receives the alarm event and sends an HMAC-signed webhook to the DevOps Agent.
Lambda code:
import json import os import hmac import hashlib import base64 import urllib3 import boto3 from datetime import datetime http = urllib3.PoolManager() SECRET_ARN = os.environ.get('SECRET_ARN') # Cache secret across warm invocations _cached_secret = None def get_webhook_config(): global _cached_secret if _cached_secret is None: client = boto3.client('secretsmanager') response = client.get_secret_value(SecretId=SECRET_ARN) _cached_secret = json.loads(response['SecretString']) return _cached_secret def lambda_handler(event, context): print(f"Received event: {json.dumps(event)}") detail = event.get('detail', {}) alarm_name = detail.get('alarmName', 'Unknown') state = detail.get('state', {}) new_state = state.get('value', 'ALARM') reason = state.get('reason', '') timestamp = state.get('timestamp', datetime.utcnow().isoformat()) region = event.get('region', 'us-east-1') account_id = event.get('account', '') if new_state != 'ALARM': return {'statusCode': 200, 'body': 'Not ALARM state, skipping'} config = detail.get('configuration', {}) metrics = config.get('metrics', []) metric_info = "" if metrics: metric = metrics[0].get('metricStat', {}).get('metric', {}) metric_info = f"\nMetric: {metric.get('namespace', '')}/{metric.get('name', '')}" dims = metric.get('dimensions', {}) if dims: metric_info += f"\nDimensions: {json.dumps(dims)}" description = f"CloudWatch Alarm: {alarm_name}\n" description += f"AWS Account: {account_id}\nRegion: {region}\n" description += f"State: {new_state}\nReason: {reason}" description += metric_info payload = { "eventType": "incident", "incidentId": f"{alarm_name}-{timestamp}", "action": "created", "priority": "HIGH", "title": f"CloudWatch Alarm: {alarm_name}", "description": description, "timestamp": timestamp, "service": alarm_name, "data": { "metadata": { "alarmName": alarm_name, "region": region, "accountId": account_id, "newState": new_state, "reason": reason, "alarmArn": event.get('resources', [''])[0], "metrics": metrics } } } payload_json = json.dumps(payload) event_timestamp = datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S.000Z') # Retrieve webhook credentials from Secrets Manager webhook_config = get_webhook_config() webhook_url = webhook_config['WEBHOOK_URL'] webhook_secret = webhook_config['WEBHOOK_SECRET'] # Generate HMAC signature (timestamp:payload format) signature_string = f"{event_timestamp}:{payload_json}" signature = hmac.new( webhook_secret.encode('utf-8'), signature_string.encode('utf-8'), hashlib.sha256 ).digest() signature_b64 = base64.b64encode(signature).decode('utf-8') headers = { 'Content-Type': 'application/json', 'x-amzn-event-timestamp': event_timestamp, 'x-amzn-event-signature': signature_b64 } response = http.request('POST', webhook_url, body=payload_json, headers=headers) print(f"Webhook response: {response.status}") if response.status in [200, 202]: return {'statusCode': 200, 'body': 'Investigation triggered'} else: raise Exception(f"Webhook failed: {response.status}")
Environment variable:
SECRET_ARN- ARN of the Secrets Manager secret
Store the webhook credentials in Secrets Manager:
aws secretsmanager create-secret \ --name devops-agent-webhook \ --secret-string '{"WEBHOOK_URL":"<your-webhook-url>","WEBHOOK_SECRET":"<your-hmac-secret>"}'
Lambda IAM permissions: The execution role needs secretsmanager:GetSecretValue:
{ "Effect": "Allow", "Action": "secretsmanager:GetSecretValue", "Resource": "arn:aws:secretsmanager:<REGION>:<ACCOUNT_ID>:secret:devops-agent-webhook-*" }
The secret is cached in memory across warm Lambda invocations, so Secrets Manager is only called once per cold start.
Step 5: Create the EventBridge Rule
Create a rule that triggers the Lambda when the alarm enters ALARM state:
aws events put-rule \ --name nfm-alarm-to-investigation \ --event-pattern '{ "source": ["aws.cloudwatch"], "detail-type": ["CloudWatch Alarm State Change"], "detail": { "alarmName": ["NFM-Alarm-EKS-DB"], "state": {"value": ["ALARM"]} } }' \ --state ENABLED
Add the Lambda as the target:
aws events put-targets \ --rule nfm-alarm-to-investigation \ --targets '[{"Id":"lambda","Arn":"arn:aws:lambda:us-east-1:123456789012:function:devops-agent-webhook"}]'
Allow EventBridge to invoke the Lambda:
aws lambda add-permission \ --function-name devops-agent-webhook \ --statement-id eventbridge-invoke \ --action lambda:InvokeFunction \ --principal events.amazonaws.com \ --source-arn arn:aws:events:us-east-1:123456789012:rule/nfm-alarm-to-investigation
Step 6: Test the Pipeline
Force the alarm to transition from OK to ALARM:
# Set to OK first aws cloudwatch set-alarm-state \ --alarm-name NFM-Alarm-EKS-DB \ --state-value OK \ --state-reason "Reset for testing" # Wait a few seconds, then trigger aws cloudwatch set-alarm-state \ --alarm-name NFM-Alarm-EKS-DB \ --state-value ALARM \ --state-reason "Testing: NFM Timeouts exceeded threshold"
Within seconds, check:
- Lambda logs (
/aws/lambda/devops-agent-webhook) - should show webhook response 200/202 - DevOps Agent console - a new investigation should appear automatically
Result - Autonomous Investigation Before a Human Even Joins
The key value is time. Traditionally, an alarm fires, pages an engineer, they open a laptop, start investigating from scratch - 40 to 70 minutes pass before root cause is found. With this automation, the DevOps Agent begins investigating within seconds of the alarm firing. By the time the on-call engineer opens their laptop, the investigation is already complete or in progress with root cause identified and a mitigation plan ready.
AWS DevOps Agent correlates metrics, logs, deployment history, and network flow data automatically. It identifies issues stemming from system changes, resource limits, component failures, and dependency issues. Once root cause is identified, it provides detailed mitigation plans including actions to resolve, validate, and revert if needed. It also correlates related alarms to determine if they originate from the same event - reducing noise so teams focus on what matters.
The engineer's role shifts from "investigate from scratch" to "review findings and approve the fix." MTTR drops from hours to minutes.
What DevOps Agent Investigates
When triggered by the NFM alarm, DevOps Agent will:
- Analyze CloudWatch metrics (timeouts, retransmissions, round-trip time)
- Correlate with Network Firewall logs for blocked traffic
- Review recent infrastructure changes (route tables, security groups, firewall rules)
- Check deployment history for recent code or configuration changes
- Identify root cause and provide a mitigation plan
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
No metrics in AWS/NetworkFlowMonitor | NFM agent not running or monitor too new | Verify agent pods are running; wait 15 min after creation |
Alarm stays in INSUFFICIENT_DATA | Wrong dimension format | Use MonitorId with full ARN, not monitor name |
| Lambda not invoked | EventBridge rule not matching | Verify alarm name in event pattern matches exactly |
| Webhook returns 401/403 | HMAC signature mismatch | Verify secret key; ensure timestamp format is %Y-%m-%dT%H:%M:%S.000Z |
| Investigation not created | Webhook URL incorrect | Verify URL from DevOps Agent console; check Lambda logs for response body |
Summary
By connecting Network Flow Monitor metrics to AWS DevOps Agent through CloudWatch Alarms and EventBridge, you create a fully automated incident response pipeline. Network degradation is detected within minutes, and an AI-powered investigation begins immediately, without waiting for a human to notice the problem.
References
- Language
- English
Relevant content
- asked 6 months ago
- asked a month ago
