Here is my issue.
I recently update a component (Python) to be deployed to my Greengrass core devices. One of the new feature of the component was to log errors to Datadog.
I create a deployment and published it. It went rather well with over 16000 devices updating and only 22 failures. Many of the failure could be solved by forcing a re-installation of Greengrass.
But, 25-30 of the successful deployment led to extremely weird behaviour. Some of the devices logging hundreds of thousands of entries in Datadog. Now, given the way the logging works, that would seem impossible. Here is the relevant code:
async def _process_logs(self):
current_log = None
is_from_retry = False
try:
while True:
if current_log:
await asyncio.sleep(self.retry_interval)
is_from_retry = True
else:
current_log = await self._msg_queue.get()
is_from_retry = False
if await self._send_log_to_datadog(current_log, is_from_retry):
current_log = None
except asyncio.CancelledError:
print("Datadog logger has been cancelled")
except Exception as e:
print(f"Datadog: Something terrible happened. {e}")
The upstream logger is simply putting log messages on the queue (If there is space) and downstream (self._send_log_to_datadog) is using aiohttp to try to send the logs to Datadog one and only one time (no loop) returning :"True" if it succeeded and "False" otherwise.
Now, in one instance a device logged 265K log entries with "is_from_retry" false and 3.65K log entries with "is_fom_retry" true. This seems to indicate that the logging is coming fom thee device itself and AFAICT could only be explained by Greengrass spawning numerous instances of the same component (the log we see would occur during initialisation.)
I m using the latest version (Aug 30 2004) of Nucleus and ShadowManager
Has anyone experienced something similar?
Could you provide the greengrass.log and a thread dump of the Greengrass Nucleus process to help us investigate the issue?
Unfortunately I cannot. Once they get into that logging frenzy the stop responding so I cannot access them. The machine are on the other side of the world, so I cannot get there.
Could you cut us a support ticket and provide the account id please? If possible, please also share the logs in the Datadog and we'll investigate it further