Skip to content

How to auto-restart failed Greengrass v2 components when MQTT connection resumes after DNS failures?

0

Environment

We manage a fleet of approximately 130 AWS Greengrass v2 devices that occasionally experience DNS resolution failures with the MQTT client.

Observed Behavior

When DNS issues occur, we see repeated errors:

[ERROR] com.aws.greengrass.mqttclient.AwsIotMqtt5Client: Failed to connect to AWS IoT Core. {clientId=MyDevice, error=A query to dns failed to resolve.}
[INFO] (AwsEventLoop 4) com.aws.greengrass.mqttclient.AwsIotMqtt5Client: Connection resumed. {clientId=MyDevice, sessionPresent=true} 2025-08-29T02:36:28.655Z 
[INFO] (AwsEventLoop 4) com.aws.greengrass.status.FleetStatusService: fss-status-update-published. Status update published to FSS. {trigger=RECONNECT, serviceName=FleetStatusService, currentState=RUNNING} 2025-08-29T02:36:28.732Z

While the connection sometimes self-recovers (after 30-60 seconds), we've observed cases where devices remain stuck in this state for extended periods, impacting our component operations.

Current Recovery Methods

Currently, we manually:

  • Restart the entire Greengrass service: systemctl restart greengrass
  • Or reboot the device completely

Questions

  1. Is there a way for components to detect MQTT connectivity issues and trigger their own restart?
  2. Can we implement a watchdog component that monitors for DNS/MQTT failures and restarts affected components programmatically?
  3. Are there Greengrass v2 APIs or lifecycle methods that components can use to:
    • Detect when the MQTT client is stuck
    • Force a reconnection attempt
    • Restart themselves or other components
  4. What's the recommended pattern for implementing self-healing components that can recover from core service connectivity issues?

Use Case

We need an automated solution that can detect these DNS/MQTT failures and take corrective action without manual intervention, especially for remote devices with limited access.

Any examples of component-level recovery mechanisms or watchdog implementations would be greatly appreciated.

asked 3 months ago82 views
2 Answers
1

Hi Phil.

What's the recommended pattern for implementing self-healing components that can recover from core service connectivity issues?

Can you please elaborate on why your components get in a state that requires them to be restarted? What are they trying to do?

DNS is just another reason that your device could go offline. Ideally your components would expect loss of connectivity, regardless of the cause, and should not enter an unrecoverable state.

AWS
EXPERT
answered 3 months ago
0

In AWS IoT Greengrass v2, there are several approaches to handle component restarts when MQTT connectivity resumes after DNS failures:

Restarting Components

  1. Using the Greengrass CLI: You can create a watchdog component that uses the Greengrass CLI to restart specific components when issues are detected:
sudo /greengrass/v2/bin/greengrass-cli component restart --names "YourComponentName"

This is particularly useful during development and testing.

  1. IPC Service for Component Management: For a programmatic approach, use the Greengrass CLI IPC service (available in Greengrass nucleus v2.6.0+) to restart components. Your watchdog component can call the RestartComponent operation, which returns a status of either SUCCEEDED or FAILED.

  2. Configuration-based Restart Triggers: Some components support configuration parameters specifically for triggering restarts. For example, the EMQX MQTT broker component has a restartIdentifier configuration option that, when changed, forces the broker to restart.

Self-healing Component Patterns

To implement a self-healing watchdog component:

  1. Create a monitoring component that:
  • Periodically checks MQTT connectivity status
  • Logs DNS resolution attempts
  • Detects when connections are stuck or failing
  • Takes corrective action by restarting affected components
  1. Use lifecycle configuration:
  • The bootstrap lifecycle can be used for one-time setup operations
  • The run lifecycle can implement your monitoring logic
  • Use environment variables in the Setenv configuration to force component restarts when needed
  1. Force component restarts by:
  • Updating component versions (even with no other changes)
  • Resetting component configurations to default settings
  • Changing environment variables in the component's configuration

For components that terminate unexpectedly, note that Greengrass v2 doesn't automatically restart them like Greengrass v1 did. You may need to implement a custom solution where a monitoring component detects termination and triggers a restart through the CLI or IPC service.

By implementing these patterns, you can create a robust self-healing system that automatically recovers from DNS and MQTT connectivity issues without manual intervention.
Sources
How do I redeploy the same deployment config in Greengrass v2 | AWS re:Post
Manage local deployments and components - AWS IoT Greengrass
IOT Greengrass component executing on every restart. | AWS re:Post
MQTT 5 broker (EMQX) - AWS IoT Greengrass

answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.