I want to troubleshoot an Amazon EMR Serverless job that doesn’t start, runs slowly, or is stuck in the PENDING state.
Short description
A cold start occurs when resources take time to provision because a system is idle or scales from a low base.
If you submit a job and the workers from initialCapacity are available, then the job uses the initialCapacity resources to run. If another job is using the initialCapacity resources, then the Amazon EMR Serverless application requests additional workers up to the maximum quota.
Resolution
To troubleshoot an Amazon EMR Serverless job that doesn't start, runs slowly, or is stuck in the PENDING state, take the following actions:
- To keep your drivers and workers ready to quickly respond and immediately start your application, use pre-initialized capacity.
- Set up an appropriate initialCapacity for Hive and Spark.
- Configure different sizes for drivers and executors.
- To scale up your job, specify the maximum capacity for your CPU, memory, and disk.
- To avoid idle resources, align your container sizes with your pre-initialized capacity worker sizes. For example, make sure that your Spark run sizes are the same as your pre-initialized capacity worker sizes.
- To determine you application performance and identify potential bottlenecks, review the stages and time periods for each stage in the Spark UI or Hive Tez UI. For more information, see Job worker-level monitoring and Spark troubleshooting and performance tuning.
- Follow Spark best practices and Hive best practices. For Amazon EMR releases 7.1.0 and later, use shuffle-optimized disks when you run Apache Spark or Hive jobs to improve performance for I/O intensive workloads.
- To troubleshoot job failures, choose how Amazon EMR Serverless stores and manages your application logs.
- Modify or turn off the auto-stop configuration feature. The default value is set to 15 minutes.
- To prevent bottlenecks when your job immediately needs high concurrency, don't use spark.executor.instances that are set to 1.
- To improve your job performance, increase the spark.dynamicAllocation.minExecutors when spark.dynamicAllocation.enabled is true. Also, increase the value of spark.executor.instances when spark.dynamicAllocation.enabled is false.