Timeout setting on long running Lambda

0

I need some clarification on the meaning of the timeout settings for long running lambdas on greengrass:

I have reduced my function to the simplest version with meaning:

from  threading import Thread,Timer
import logging

logger = logging.getLogger()

Invocation = 0

def ggLongRunnerFunction():
  global Invocation
 
  Timer(30, ggLongRunnerFunction).start() #schedule next call
  logger.info("ggLongRunnerFunction invocation {}".format(Invocation))
  return

ggLongRunner()  #first start + schedule next call, just like the hello world example
#
# Dummy Handler called only if hit by a subscription
#
def function_handler(event, context):
  logger.warn("function_handler called")
  return

I see the main body gets executed once on the function deployment, the ggLongRunnerFunction gets executed regularly.
What is the meaning of the timeout value set under lambda in the Groups settings ?
Which time is measured ? That of the ggLongRunnerFunction ? The whole file ? function_handler ?
I am asking this, because sometimes I see "runtime terminated" messages in the logs, but that does not correspond to the timeout values, after that there are messages stating GG cannot stop workers properly and this results in all lambda functions being completely dead until the next deployment, i.e. also a second lambda wont run, even if triggered by mqtt.
I was thinking maybe the timeout kills the ggLongRunner function in midexecution, this results in the missing task gg complains about and then stops lambda execution altogether.

I now tried this with ggLongRunner defined as the handlerfunction and without with a timeout settin of 1 sec and it seems there is no timeout at all happening although the function waits for 20 secs :
Is that correct with long lived functions?

def ggLongRunner(event, context):
  global Invocation

  Invocation = Invocation + 1
  
  logger.info("ggLongRunner invocation[{}] Entry".format(Invocation))
  client.update_thing_shadow(thingName="Homegear01",payload='{"state":{"reported":{"Invocation":'+str(Invocation)+'}}}')

  sleep(20)

  logger.info("ggLongRunner invocation[{}] Exit".format(Invocation))
  return

while True:
  ggLongRunner(event = None, context = None)

Edited by: TJirsch on Jan 17, 2019 3:16 PM

TJirsch
asked 6 years ago648 views
5 Answers
0

Hi TJirsch,

The timeout on a 'pinned' function defines the amount of time the Core will allow for the handler to be called. If this is exceeded, the function is terminated.

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/greengrass.html

Pinned (boolean) -- True if the function is pinned. Pinned means the function is long-lived and starts when the core starts.
Timeout (integer) -- The allowed function execution time, after which Lambda should terminate the function. This timeout still applies to pinned Lambda functions for each request.

I have experienced adverse affects with blocking the 'main' thread (which shouldn't be a problem with your first example, only your second), as this then doesn't allow the Core to invoke the available handler function upon a subscription. The method I have developed for this (if I need extra things running in the background) is as follows:

def background_worker():
  # does something in a while loop and has waits, for example, turn on LED, wait 5 seconds, turn on LED, and repeat

def lambda_handler(event, context):
    #do_event_logic

# Start extra threads when the Lambda is initially launched
t1 = threading.Thread(target=background_worker)
t1.start()

This keeps the main thread open for any incoming requests, and should solve any weird timeout issues you may be experiencing due to this blockage.

It is also worth noting that all runtimes are terminated during a deployment, so this may factor into the randomness in your logs.

Hope this helps,
Marty

answered 6 years ago
0

Hi Marty,

thank you for your help. I used the Loop variant to eliminate any possible influence by threading.
I have now tried the following with ggLongRunner defined as the handler function:

Invocation = 0
logger.info("ggLongRunnerFile Initialization")

def ggLongRunner(event, context):
  global Invocation

  Invocation = Invocation + 1

  if (event != None):
    logger.info("ggLongRunner invocation by event[{}] Entry".format(event))
  else:
    logger.info("ggLongRunner invocation by Loop [{}] Entry".format(Invocation))
    Timer(60,ggLongRunner,(None, None)).start()

  client.update_thing_shadow(thingName="Homegear01",payload='{"state":{"reported":{"Invocation":'+str(Invocation)+'}}}')

  sleep(20)

  logger.info("ggLongRunner invocation[{}] Exit".format(Invocation))
  return

Timer(10,ggLongRunner,(None, None)).start()

This results in the timer triggered ggLongRunner calls not being terminated by the timeout set in the lambda configuration and ggLongRunner calls triggered by sending a mqtt message being terminated by the timeout.

So the clarification is: The timeout is valid only for calls to the handler function defined in the lambda configuration and for the timeout value defined in the group lambda config.
All other functionality defined in the python file(s) that defines the handler is not touched.
Even programmatically calling the handler will not activate the timeout, so there is no special decoration or code amendment of the handler function.
Good to know.

I will now try to find out, what happens if my function runs from the timer Invocation and parallel requests come in from mqtt.
Is that situation thread safe or even possible ?

What I really need is a reliable, time scheduled mechanism running on a GG instance for querying some external devices that are not able to send via mqtt and relaying the results to the iot core.
I do not want to schedule python scripts via cron, because that will mean I'll have to do setup and montoring etc. on each device.

Thomas

Edited by: TJirsch on Jan 18, 2019 4:42 AM

TJirsch
answered 6 years ago
0

I believe all events will be processed sequentially in a FIFO manner when using a pinned Lambda, so it should be thread safe.

The timeouts are controlled by the environment invoker, so any internal calls within your code will not be affected by it, though I recommend having a handler solely for message processing, and another for your background tasks.

If you are simply recording state, there should be very little use for your Lambda Handler, so you should be able to do something like the following:

import threading
import json

# Global variables
client = greengrasssdk.client('iot-data')
poller_count = 0

# Handler which is only called when an incoming IOT message
# causes it to trigger
def lambda_handler(event, context):
    print("Event received: {}".format(event))
    return

# Poller which runs every (roughly) 60 seconds by
# scheduling itself.
def gg_poller():
    global poller_count

    # Increment the counter/do work here
    poller_count = poller_count + 1

    # Payload
    payload = {
        'state': {
            'reported': {
                'poller_count': poller_count
            }
        }
    }

    # Update the shadow
    client.update_thing_shadow(thingName="Homegear01",
                               payload=json.dumps(payload))

    # Schedule the next run in 60 seconds
    threading.Timer(60, gg_poller).start()
    return

# When initially started, do a gg_poller, which will then schedule itself
gg_poller()

Hope that way of calling thing helps,
Marty

Edited by: martysweet on Jan 19, 2019 10:20 AM

answered 6 years ago
0

That is exactly what i am doing, right now. After some invocations, the whole GG Shadow worker is going down, but I cannot figure out why. I inserted counters to track function calls, and make sure no call gets terminated or does not return. All I can see that the messages from the publish get sometimes processed before the poller exits, but that should be ok.

What I get after some invocations are messages like this:

[2019-01-19T17:03:34.912+01:00][INFO]-Get work item: 083779e3-8d17-45b0-7efc-3fe3314db248

==> /greengrass/ggc/var/log/system/runtime.log <==
[2019-01-19T17:03:38.917+01:00][WARN]-connection was closed by client: EOF
[2019-01-19T17:03:38.996+01:00][INFO]-Function arn:aws:lambda:eu-central-1:913839406633:function:ggLongRunner:79 Worker ad6a606f-1e74-4fae-6acb-b10449481f87 Max Memory Used 11832 KB
[2019-01-19T17:03:39.196+01:00][INFO]-Function arn:aws:lambda:::function:GGShadowSyncManager Worker 4052ee0c-170c-4cd3-65d4-3b9ddb125cfc Max Memory Used 14604 KB
[2019-01-19T17:03:40.967+01:00][INFO]-Tearing down worker process factory
[2019-01-19T17:03:40.996+01:00][INFO]-Function arn:aws:lambda:::function:GGDeviceCertificateManager Worker 9db4bbf4-8a11-4c61-5e7f-1e098cec2618 Max Memory Used 11332 KB
[2019-01-19T17:03:41.196+01:00][INFO]-Function arn:aws:lambda:::function:GGSecretManager:1 Worker 02392278-ff32-4215-49bd-d0697ae7b1fd Max Memory Used 10296 KB
[2019-01-19T17:03:41.396+01:00][INFO]-Function arn:aws:lambda:::function:GGCloudSpooler:1 Worker 3041cd1a-bb9c-4351-6047-ebbb78e189b6 Max Memory Used 10220 KB
[2019-01-19T17:03:41.596+01:00][INFO]-Function arn:aws:lambda:::function:GGTES Worker 061b4c03-6254-4ca6-7cf5-f3e1b7271fa8 Max Memory Used 10204 KB
[2019-01-19T17:03:41.796+01:00][INFO]-Function arn:aws:lambda:::function:GGConnManager Worker 1da27a21-0738-44d6-7d81-f68341d7a0d2 Max Memory Used 10620 KB
[2019-01-19T17:03:42.271+01:00][INFO]-message received from (arn:aws:lambda:::function:GGShadowService) with subject ($aws/things/Homegear01/shadow/update/accepted)
[2019-01-19T17:03:42.272+01:00][INFO]-message received from (arn:aws:lambda:::function:GGShadowService) with subject ($aws/things/Homegear01/shadow/update/documents)
[2019-01-19T17:03:42.273+01:00][WARN]-failed to put work result. Unknown invocation ID [083779e3-8d17-45b0-7efc-3fe3314db248]
[2019-01-19T17:03:43.68+01:00][WARN]-connection was closed by client: EOF
[2019-01-19T17:03:43.68+01:00][WARN]-connection was closed by client: EOF
[2019-01-19T17:03:43.796+01:00][INFO]-Function arn:aws:lambda:::function:GGShadowService Worker 4e635d12-5910-4126-4867-24828f9efe39 Max Memory Used 13296 KB
[2019-01-19T17:03:43.796+01:00][ERROR]-worker ungracefully killed arn:aws:lambda:::function:GGShadowService 4e635d12-5910-4126-4867-24828f9efe39 exit status 1
[2019-01-19T17:03:43.796+01:00][ERROR]-Failed to restart system worker. GGC will stop all workers. Please check logs to debug. unable to deregister [4e635d12-5910-4126-4867-24828f9efe39]: unable to deregister non-existent process 4e635d12-5910-4126-4867-24828f9efe39

==> /greengrass/ggc/var/log/system/GGShadowService.log <==
[2019-01-19T17:03:42.278+01:00][ERROR]-failed to get work item: [arn:aws:lambda:::function:GGShadowService]: response status code: 400
[2019-01-19T17:03:42.293+01:00][ERROR]-failed to get work item: [arn:aws:lambda:::function:GGShadowService]: response status code: 400
[2019-01-19T17:03:42.479+01:00][ERROR]-failed to get work item: [arn:aws:lambda:::function:GGShadowService]: response status code: 400
[2019-01-19T17:03:42.519+01:00][ERROR]-failed to get work item: [arn:aws:lambda:::function:GGShadowService]: response status code: 400
[2019-01-19T17:03:42.844+01:00][ERROR]-failed to get work item: [arn:aws:lambda:::function:GGShadowService]: response status code: 400
[2019-01-19T17:03:43.156+01:00][ERROR]-failed to get work item: [arn:aws:lambda:::function:GGShadowService]: response status code: 400
[2019-01-19T17:03:43.293+01:00][ERROR]-failed to get work item: [arn:aws:lambda:::function:GGShadowService]: response status code: 400
[2019-01-19T17:03:43.584+01:00][ERROR]-failed to get work item: [arn:aws:lambda:::function:GGShadowService]: response status code: 400
[2019-01-19T17:03:43.674+01:00][ERROR]-failed to get work item: [arn:aws:lambda:::function:GGShadowService]: response status code: 400
[2019-01-19T17:03:43.675+01:00][FATAL]-the requested task has timed out, taskName: arn:aws:lambda:::function:GGShadowService, haveTried: 10 out of 10, totalWaitTime: 1.37s, multiplier: 50, backoffCoefficient: 2, jitterEnabled: true, retryErrors: [[arn:aws:lambda:::function:GGShadowService]: response status code: 400 [arn:aws:lambda:::function:GGShadowService]: response status code: 400 [arn:aws:lambda:::function:GGShadowService]: response status code: 400 [arn:aws:lambda:::function:GGShadowService]: response status code: 400 [arn:aws:lambda:::function:GGShadowService]: response status code: 400 [arn:aws:lambda:::function:GGShadowService]: response status code: 400 [arn:aws:lambda:::function:GGShadowService]: response status code: 400 [arn:aws:lambda:::function:GGShadowService]: response status code: 400 [arn:aws:lambda:::function:GGShadowService]: response status code: 400 [arn:aws:lambda:::function:GGShadowService]: response status code: 400]

TJirsch
answered 6 years ago
0

I was sending 12 shadow updates from the lambda function for each device to sub items of one shadow device in the iot cloud.
Now i have put that into one update_shadow call, sending the same data.
No Shadowservice hickups until now, maybe that was it.

Update: Same thing also after moving the updates to a different thread. The shadow service terminates.
I will have to do it from a local python script.

Edited by: TJirsch on Jan 22, 2019 12:16 AM

TJirsch
answered 6 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions