Skip to content

Automated Network Incident Response: Using Network Flow Monitor to Trigger AWS DevOps Agent Investigations

8 minute read
Content level: Advanced
2

This article shows how to automatically trigger an AWS DevOps Agent investigation when Amazon CloudWatch Network Flow Monitor (NFM) detects network degradation. When NFM identifies retransmission timeouts between your workloads, a CloudWatch Alarm fires and invokes a Lambda function that sends an HMAC-authenticated webhook to the DevOps Agent, creating an investigation automatically without manual intervention.

The Problem

In modern cloud architectures, east-west traffic between application tiers traverses multiple network components, Transit Gateways, centralized inspection VPCs with Network Firewalls, and complex routing tables. When something breaks in this path, the failures are often intermittent rather than complete. A misconfigured firewall rule might block one availability zone while the other keeps working. Users see random timeouts, but health checks still pass.

These issues are hard to detect because TCP retransmissions happen silently at the network layer. The application retries, some requests succeed on alternate paths, and nobody notices until the problem compounds. When an engineer finally spots it, tracing the issue through TGW route tables, firewall rules, NACLs, and security groups takes time.

The question this article answers: how do you detect network-layer degradation in real time and start investigating immediately, before a human even knows there's a problem.

The Solution

NFM Agent (on EKS nodes)
    → CloudWatch Metric: Timeouts
        → CloudWatch Alarm (Timeouts > threshold)
            → EventBridge Rule
                → Lambda Function (HMAC webhook)
                    → AWS DevOps Agent
                        → Investigation auto-created

Prerequisites

  • An existing workload with Network Flow Monitor configured (agent installed, monitor created)
  • An AWS DevOps Agent space with a Generic (HMAC) webhook configured
  • Familiarity with CloudWatch Alarms, EventBridge, and Lambda

Step 1: Identify the Correct NFM Metric

NFM publishes metrics under the AWS/NetworkFlowMonitor namespace. The key metrics for detecting degradation are:

MetricDescriptionUse Case
TimeoutsConnection timeout countDetect blocked or unreachable paths
RetransmissionsTCP retransmission countDetect packet loss or congestion
RoundTripTimeNetwork latencyDetect latency spikes
HealthIndicatorOverall health (0=healthy, 1=degraded)Broad health monitoring

Important: The dimension is MonitorId with the full monitor ARN as the value, not the monitor name.

# Verify your monitor is publishing metrics
aws cloudwatch list-metrics \
  --namespace AWS/NetworkFlowMonitor \
  --region us-east-1

Example output:

{
  "Namespace": "AWS/NetworkFlowMonitor",
  "MetricName": "Timeouts",
  "Dimensions": [
    {
      "Name": "MonitorId",
      "Value": "arn:aws:networkflowmonitor:us-east-1:123456789012:monitor/my-monitor"
    }
  ]
}

Note: NFM takes 10-15 minutes after monitor creation to start publishing metrics. If you don't see metrics, ensure the NFM agent pods are running and traffic is flowing through the monitored path.

Step 2: Create the CloudWatch Alarm

Create an alarm that fires when timeouts exceed your threshold:

aws cloudwatch put-metric-alarm \
  --alarm-name NFM-Alarm-EKS-DB \
  --alarm-description "Network degradation detected between EKS and RDS" \
  --namespace AWS/NetworkFlowMonitor \
  --metric-name Timeouts \
  --dimensions Name=MonitorId,Value=arn:aws:networkflowmonitor:us-east-1:123456789012:monitor/my-monitor \
  --statistic Sum \
  --period 300 \
  --evaluation-periods 1 \
  --threshold 50 \
  --comparison-operator GreaterThanThreshold \
  --treat-missing-data notBreaching

Choose your threshold based on your baseline. You can check current values:

aws cloudwatch get-metric-statistics \
  --namespace AWS/NetworkFlowMonitor \
  --metric-name Timeouts \
  --dimensions Name=MonitorId,Value=arn:aws:networkflowmonitor:us-east-1:123456789012:monitor/my-monitor \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 300 --statistics Sum

Step 3: Configure the DevOps Agent Webhook

  1. Open the AWS DevOps Agent console
  2. Navigate to your Agent Space → CapabilitiesWebhooks
  3. Click Add and select Generic (HMAC)
  4. Save the generated Webhook URL and Secret Key

Step 4: Create the Lambda Function

Create a Lambda function (Python 3.12, 30s timeout) that receives the alarm event and sends an HMAC-signed webhook to the DevOps Agent.

Lambda code:

import json
import os
import hmac
import hashlib
import base64
import urllib3
import boto3
from datetime import datetime

http = urllib3.PoolManager()
SECRET_ARN = os.environ.get('SECRET_ARN')

# Cache secret across warm invocations
_cached_secret = None

def get_webhook_config():
    global _cached_secret
    if _cached_secret is None:
        client = boto3.client('secretsmanager')
        response = client.get_secret_value(SecretId=SECRET_ARN)
        _cached_secret = json.loads(response['SecretString'])
    return _cached_secret

def lambda_handler(event, context):
    print(f"Received event: {json.dumps(event)}")

    detail = event.get('detail', {})
    alarm_name = detail.get('alarmName', 'Unknown')
    state = detail.get('state', {})
    new_state = state.get('value', 'ALARM')
    reason = state.get('reason', '')
    timestamp = state.get('timestamp', datetime.utcnow().isoformat())
    region = event.get('region', 'us-east-1')
    account_id = event.get('account', '')

    if new_state != 'ALARM':
        return {'statusCode': 200, 'body': 'Not ALARM state, skipping'}

    config = detail.get('configuration', {})
    metrics = config.get('metrics', [])
    metric_info = ""
    if metrics:
        metric = metrics[0].get('metricStat', {}).get('metric', {})
        metric_info = f"\nMetric: {metric.get('namespace', '')}/{metric.get('name', '')}"
        dims = metric.get('dimensions', {})
        if dims:
            metric_info += f"\nDimensions: {json.dumps(dims)}"

    description = f"CloudWatch Alarm: {alarm_name}\n"
    description += f"AWS Account: {account_id}\nRegion: {region}\n"
    description += f"State: {new_state}\nReason: {reason}"
    description += metric_info

    payload = {
        "eventType": "incident",
        "incidentId": f"{alarm_name}-{timestamp}",
        "action": "created",
        "priority": "HIGH",
        "title": f"CloudWatch Alarm: {alarm_name}",
        "description": description,
        "timestamp": timestamp,
        "service": alarm_name,
        "data": {
            "metadata": {
                "alarmName": alarm_name,
                "region": region,
                "accountId": account_id,
                "newState": new_state,
                "reason": reason,
                "alarmArn": event.get('resources', [''])[0],
                "metrics": metrics
            }
        }
    }

    payload_json = json.dumps(payload)
    event_timestamp = datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S.000Z')

    # Retrieve webhook credentials from Secrets Manager
    webhook_config = get_webhook_config()
    webhook_url = webhook_config['WEBHOOK_URL']
    webhook_secret = webhook_config['WEBHOOK_SECRET']

    # Generate HMAC signature (timestamp:payload format)
    signature_string = f"{event_timestamp}:{payload_json}"
    signature = hmac.new(
        webhook_secret.encode('utf-8'),
        signature_string.encode('utf-8'),
        hashlib.sha256
    ).digest()
    signature_b64 = base64.b64encode(signature).decode('utf-8')

    headers = {
        'Content-Type': 'application/json',
        'x-amzn-event-timestamp': event_timestamp,
        'x-amzn-event-signature': signature_b64
    }

    response = http.request('POST', webhook_url, body=payload_json, headers=headers)
    print(f"Webhook response: {response.status}")

    if response.status in [200, 202]:
        return {'statusCode': 200, 'body': 'Investigation triggered'}
    else:
        raise Exception(f"Webhook failed: {response.status}")

Environment variable:

  • SECRET_ARN - ARN of the Secrets Manager secret

Store the webhook credentials in Secrets Manager:

aws secretsmanager create-secret \
  --name devops-agent-webhook \
  --secret-string '{"WEBHOOK_URL":"<your-webhook-url>","WEBHOOK_SECRET":"<your-hmac-secret>"}'

Lambda IAM permissions: The execution role needs secretsmanager:GetSecretValue:

{
  "Effect": "Allow",
  "Action": "secretsmanager:GetSecretValue",
  "Resource": "arn:aws:secretsmanager:<REGION>:<ACCOUNT_ID>:secret:devops-agent-webhook-*"
}

The secret is cached in memory across warm Lambda invocations, so Secrets Manager is only called once per cold start.

Step 5: Create the EventBridge Rule

Create a rule that triggers the Lambda when the alarm enters ALARM state:

aws events put-rule \
  --name nfm-alarm-to-investigation \
  --event-pattern '{
    "source": ["aws.cloudwatch"],
    "detail-type": ["CloudWatch Alarm State Change"],
    "detail": {
      "alarmName": ["NFM-Alarm-EKS-DB"],
      "state": {"value": ["ALARM"]}
    }
  }' \
  --state ENABLED

Add the Lambda as the target:

aws events put-targets \
  --rule nfm-alarm-to-investigation \
  --targets '[{"Id":"lambda","Arn":"arn:aws:lambda:us-east-1:123456789012:function:devops-agent-webhook"}]'

Allow EventBridge to invoke the Lambda:

aws lambda add-permission \
  --function-name devops-agent-webhook \
  --statement-id eventbridge-invoke \
  --action lambda:InvokeFunction \
  --principal events.amazonaws.com \
  --source-arn arn:aws:events:us-east-1:123456789012:rule/nfm-alarm-to-investigation

Step 6: Test the Pipeline

Force the alarm to transition from OK to ALARM:

# Set to OK first
aws cloudwatch set-alarm-state \
  --alarm-name NFM-Alarm-EKS-DB \
  --state-value OK \
  --state-reason "Reset for testing"

# Wait a few seconds, then trigger
aws cloudwatch set-alarm-state \
  --alarm-name NFM-Alarm-EKS-DB \
  --state-value ALARM \
  --state-reason "Testing: NFM Timeouts exceeded threshold"

Within seconds, check:

  1. Lambda logs (/aws/lambda/devops-agent-webhook) - should show webhook response 200/202
  2. DevOps Agent console - a new investigation should appear automatically

Result - Autonomous Investigation Before a Human Even Joins

The key value is time. Traditionally, an alarm fires, pages an engineer, they open a laptop, start investigating from scratch - 40 to 70 minutes pass before root cause is found. With this automation, the DevOps Agent begins investigating within seconds of the alarm firing. By the time the on-call engineer opens their laptop, the investigation is already complete or in progress with root cause identified and a mitigation plan ready.

AWS DevOps Agent correlates metrics, logs, deployment history, and network flow data automatically. It identifies issues stemming from system changes, resource limits, component failures, and dependency issues. Once root cause is identified, it provides detailed mitigation plans including actions to resolve, validate, and revert if needed. It also correlates related alarms to determine if they originate from the same event - reducing noise so teams focus on what matters.

The engineer's role shifts from "investigate from scratch" to "review findings and approve the fix." MTTR drops from hours to minutes.

What DevOps Agent Investigates

When triggered by the NFM alarm, DevOps Agent will:

  • Analyze CloudWatch metrics (timeouts, retransmissions, round-trip time)
  • Correlate with Network Firewall logs for blocked traffic
  • Review recent infrastructure changes (route tables, security groups, firewall rules)
  • Check deployment history for recent code or configuration changes
  • Identify root cause and provide a mitigation plan

Troubleshooting

IssueCauseFix
No metrics in AWS/NetworkFlowMonitorNFM agent not running or monitor too newVerify agent pods are running; wait 15 min after creation
Alarm stays in INSUFFICIENT_DATAWrong dimension formatUse MonitorId with full ARN, not monitor name
Lambda not invokedEventBridge rule not matchingVerify alarm name in event pattern matches exactly
Webhook returns 401/403HMAC signature mismatchVerify secret key; ensure timestamp format is %Y-%m-%dT%H:%M:%S.000Z
Investigation not createdWebhook URL incorrectVerify URL from DevOps Agent console; check Lambda logs for response body

Summary

By connecting Network Flow Monitor metrics to AWS DevOps Agent through CloudWatch Alarms and EventBridge, you create a fully automated incident response pipeline. Network degradation is detected within minutes, and an AI-powered investigation begins immediately, without waiting for a human to notice the problem.

References

AWS
EXPERT
published 14 days ago174 views