Can Greengrass spawn too many Component processes?

0

Here is my issue.

I recently update a component (Python) to be deployed to my Greengrass core devices. One of the new feature of the component was to log errors to Datadog.

I create a deployment and published it. It went rather well with over 16000 devices updating and only 22 failures. Many of the failure could be solved by forcing a re-installation of Greengrass.

But, 25-30 of the successful deployment led to extremely weird behaviour. Some of the devices logging hundreds of thousands of entries in Datadog. Now, given the way the logging works, that would seem impossible. Here is the relevant code:

    async def _process_logs(self):
        current_log = None
        is_from_retry = False
        try:
            while True:
                if current_log:
                    await asyncio.sleep(self.retry_interval)
                    is_from_retry = True
                else:
                    current_log = await self._msg_queue.get()
                    is_from_retry = False

                if await self._send_log_to_datadog(current_log, is_from_retry):
                    current_log = None

        except asyncio.CancelledError:
            print("Datadog logger has been cancelled")
        except Exception as e:
            print(f"Datadog: Something terrible happened. {e}")

The upstream logger is simply putting log messages on the queue (If there is space) and downstream (self._send_log_to_datadog) is using aiohttp to try to send the logs to Datadog one and only one time (no loop) returning :"True" if it succeeded and "False" otherwise.

Now, in one instance a device logged 265K log entries with "is_from_retry" false and 3.65K log entries with "is_fom_retry" true. This seems to indicate that the logging is coming fom thee device itself and AFAICT could only be explained by Greengrass spawning numerous instances of the same component (the log we see would occur during initialisation.)

I m using the latest version (Aug 30 2004) of Nucleus and ShadowManager

Has anyone experienced something similar?

  • Could you provide the greengrass.log and a thread dump of the Greengrass Nucleus process to help us investigate the issue?

  • Unfortunately I cannot. Once they get into that logging frenzy the stop responding so I cannot access them. The machine are on the other side of the world, so I cannot get there.

  • Could you cut us a support ticket and provide the account id please? If possible, please also share the logs in the Datadog and we'll investigate it further

frawau
asked a month ago66 views
1 Answer
0
Accepted Answer

Thanks but I found out what it was. My bad. In short, some devices had a policy that had somehow not been updated. This was triggering a lot of connection authorisation failures which in turn was restarting some tasks. One task was not properly cancelled... and that was creating multiple concurrent tasks.

frawau
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions