My AWS Glue job is generating too many logs in Amazon CloudWatch. I want to reduce the number of logs generated.
Short description
With AWS Glue-Spark ETL jobs, you can't control the verbosity of the logs generated by the instances that the AWS Glue jobs run on. The logs are verbose so that they can be used to monitor internal failures and help diagnose why jobs fail. However, you can use several methods to adjust the Spark logging levels:
- Choose the standard filter setting for continuous logging.
- Use the Spark context method setLogLevel.
- Use a custom log4j.properties file.
Resolution
Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.
Choose the standard filter setting for continuous logging
If you turned on continuous logging for your job, then choose the Standard filter for the Log filtering option. This filter prunes the non-useful Apache Spark driver/executor and Apache Hadoop YARN heartbeat log messages. To change the log filter setting for your AWS Glue job, do the following:
- Open the AWS Glue console.
- In the navigation pane, choose Jobs.
- Select the job that you want to update.
- Choose Action, and then choose Edit job.
- Expand the Monitoring options section.
- Select Continuous logging.
- Under Log filtering, select Standard filter.
- Choose Save.
To change this setting from the AWS CLI, use the following command:
'--enable-continuous-cloudwatch-log': 'true''--enable-continuous-log-filter': 'true'
For more information, see Turn on continuous logging for AWS Glue jobs.
Important: Even with the standard filter setting, the application master logs for the Spark jobs are still pushed to /aws-glue/jobs/output and /aws-glue/jobs/error log groups.
Use the Spark context method setLogLevel
You can set the logging level for your job with the setLogLevel method from pyspark.context.SparkContext. Valid logging levels include ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, and WARN. For more information, see the documentation for setLogLevel on the Spark website.
Use the following code to import the Spark context method and set the logging level for your job:
from pyspark.context import SparkContextsc = SparkContext()
sc.setLogLevel("new-log-level")
Note: Replace new-log-level with the logging level that you want to set for your job. This code impacts the driver log behavior, but doesn't change the executor logs.
For more information, see Configuring Logging on the Spark website.
Use a custom log4j.properties file
AWS Glue 3.0 uses Log4j 1 for logging behavior. You can customize these behaviors with a log4j.properties file. Since AWS Glue 4.0 was released, AWS Glue ETL jobs use Log4j 2 and the logging behavior is configured with a log4j2.properties file.
Note: If you apply a custom log4j.properties or log4j2.properties config file, then the AWS Glue continuous logging feature will be turned off.
For example, complete the following steps for AWS Glue 4.0:
Note: You can include your logging preferences in the log4j2.properties file. Then, you can upload the file to Amazon Simple Storage Service (Amazon S3), and use the file in the AWS Glue job.
-
Create a file named log4j2.properties to set the root logger level as ERROR.
Note: This is just an example use case. You must customize your log4j2.properties file to meet your logging needs. For more information about log4j2, see Configuration with Properties on the Apache Logging Services website.
rootLogger.level = error
rootLogger.appenderRef.stdout.ref = STDOUT
appender.console.type = Console
appender.console.name = STDOUT
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yyyy-MM-dd HH:mm:ss,SSS} %p [%t] %c{2} (%F:%M(%L)): %m%n
-
Upload the log4j2.properties file to Amazon S3 and copy the file's S3 URI.
-
In the AWS Glue job, add the following parameters:
--extra-files, s3://[objectpath]/log4j2.properties
Note: Replace s3://[objectpath]/log4j.properties with the S3 URI that you used in the previous step.
-
Save the job and then run it. After the job completes, check the related log stream in the /aws-glue/jobs/error log group.
Related information
Monitoring with Amazon CloudWatch