ElasticBeanstalk Worker (PHP 8.1 on AL2) - Scheduled Cron.yaml jobs stop running at start of month


Our stack consists of an EB WebApp and EB Worker environment running SQS scheduled jobs as documented here: https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html

The web app and worker run from the same codebase provided by a CodePipeline. We have some issues with our network configuration on the worker instance that result in gateway timeout issues, but the scheduled jobs are still successfully dispatched as SQS is configured to ignore these errors.

It's a hack, but it works as expected until the start of the month (UTC). For the past three months as soon as we hit midnight UTC on the first on the month it completely breaks, they don't run at all and there's no evidence in our logging systems that requests have even been made by SQS. We can still execute the jobs manually by SSHing into the EC2 instance and making the request manually.

The only way to resolve the issue is to completely rebuild the environment.

Unfortunately I wasn't able to export the logs prior to rebuilding the environment - here are the latest SQS logs that don't seem to show anything out of the ordinary.

2023-08-02T09:27:15Z init: initializing aws-sqsd 3.0.4 (2022-03-18) 2023-08-02T09:27:15Z schedule-parser: Successfully loaded 7 scheduled tasks from file /var/app/current/cron.yaml . 2023-08-02T09:27:15Z leader-election: initialized leader election 2023-08-02T09:27:15Z scheduler: initialized 7 job's pending time 2023-08-02T09:27:15Z pollers: start initializting poller timer... 2023-08-02T09:27:15Z pollers: start auto running poller... 2023-08-02T09:27:15Z leader-election: Starting leader election 2023-08-02T09:27:15Z leader-election: current role: worker 2023-08-02T09:27:15Z leader-election: leader election record attribute not found, inserting 2023-08-02T09:27:15Z scheduler: Starting scheduler 2023-08-02T09:27:15Z start: polling https://sqs.eu-west-2.amazonaws.com/.... 2023-08-02T09:27:16Z leader-election: leader expired at 1969-12-31 23:59:59 +0000, electing new leader ... 2023-08-02T09:27:16Z leader-election: we're now the leader 2023-08-02T09:30:00Z message: sent to http://localhost:80/restricted/cron/queueScheduler.php 2023-08-02T09:30:00Z message: sent to http://localhost:80/restricted/cron/onlineDevices.php

Any clues?

1 Answer
  1. AWS Elastic Beanstalk Worker configuration: Verify the cron.yaml file in your worker environment, making sure that the cron jobs are defined correctly, especially regarding the frequency. If there are issues with the way the cron jobs are scheduled, it might cause the jobs to stop working.

  2. SQS Message Retention: Check the message retention setting on the SQS queue that your worker environment is pulling from. By default, SQS retains messages for 4 days, but you can set the message retention period to a value from 60 seconds to 1,209,600 seconds (14 days). If messages are not processed within the configured time, they will be deleted from the queue and won't be available for processing.

  3. Instance Timezone: Check the timezone of your EC2 instances. If they are not set to UTC, and your cron jobs are scheduled based on UTC, this could potentially cause issues. You can check the timezone with date +%Z and set it to UTC with sudo timedatectl set-timezone UTC.

  4. Network Configuration: You mentioned having network configuration issues that result in gateway timeout errors. While these don't seem to prevent your jobs from running normally, it's still worth looking into. Network issues can manifest in unexpected ways and might be part of the issue.

  5. Logs and Monitoring: Since you weren't able to export the logs before rebuilding the environment, it would be beneficial to implement a logging solution that automatically exports logs to an external system like CloudWatch or S3. This way, you can analyze them in case the issue happens again. Additionally, consider implementing more detailed monitoring on both the worker instances and the SQS queue to help pinpoint exactly when and why the jobs stop running.

profile picture
answered 9 months ago
  • I've verified that the EC2 timezone is set to UTC. The cron.yaml is OK and sqsd and our developer tools parse them correctly. Our message retention is set to 14 days (I configured this in the CloudFormation template) but I'm not entirely sure how that would affect job scheduling. However, we have had issues with the worker environment failing because of SQSD segfaulting or just failing to launch on some versions of AL2 in the past. This was resolved by downgrading, and this hasn't been an issue on newer version of AL2. I'll look at CloudWatch but we are heavily cost constrained at the moment.

  • Another thing I've noticed is that sometimes the scheduled jobs run a few minutes late. Something expected to run at 17:00:00 UTC might run at 17:02:01 instead (give or take). Our current way of running scheduled jobs is heavily unreliable and this is something that we want to change going forward.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions