Skip to content

How do I set Spark parameters in Amazon EMR?

3 minute read
0

I want to configure Apache Spark parameters in Amazon EMR.

Short description

To configure Spark applications, use command line arguments such as the spark-submit. Or configure the values in the spark-defaults.conf file to make the changes permanent.

Resolution

Use spark-submit to configure Spark parameters

To load configurations dynamically through the Spark shell and spark-submit command, use one of the following options:

  • Command line options, such as --num-executors.
  • The --conf flag.

Note: To see the complete options list, run spark-submit--help.

The spark-submit command reads the configuration options from spark-defaults.conf.

In the spark-defaults.conf file, each line includes a key and a value separated by white space.

For more information, see Submitting user applications with spark-submit. For more information on the parameters supported by Spark, see Spark configuration on the Apache Spark website.

Example configuration options:

--class  \
--master  \
--deploy-mode  \
--conf = \
--num-executors  \
--executor-memory G \
--driver-memory G \
--executor-cores  \
--driver-cores  \
--jars  \
--packages  \
--py-files < Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps> \

The spark-submit command automatically transfers the application JAR, and any JARs included with the --jars option to the cluster. You must separate URLs supplied after --jars by commas. spark-submit includes the list in the driver and executor class paths, and copies the JARs and files to the working directory for each SparkContext on the executor nodes.

Note: Directory expansion doesn't work with --jars.

Example spark-submit command:

spark-submit \
    --deploy-mode cluster \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.dynamicAllocation.enabled=false \
    --master yarn \
    --num-executors 4 \
    --driver-memory 4G \
    --executor-memory 4G \
    --executor-cores 1 \
    /usr/lib/spark/examples/jars/spark-examples.jar \
    10

To pass the memory parameters, use the flag --conf:

spark-submit \
    --deploy-mode cluster \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.dynamicAllocation.enabled=false \
    --master yarn \
    --conf spark.driver.memory=1G \
    --conf spark.executor.memory=1G \
    /usr/lib/spark/examples/jars/spark-examples.jar \
    10

Use custom Spark parameters to launch spark-shell and pyspark shell

To launch spark-shell or pyspark shell, run the following commands:

spark-shell

spark-shell \
    --conf spark.driver.maxResultSize=1G \
    --conf spark.driver.memory=1G \
    --deploy-mode client \
    --conf spark.executor.memory=1G \
    --conf spark.executor.heartbeatInterval=10000000s \
    --conf spark.network.timeout=10000001s \
    --executor-cores 1 \
    --num-executors 5 \
    --packages org.apache.spark:spark-avro_2.12:3.1.2 \
    --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

pyspark shell

pyspark \
    --conf spark.driver.maxResultSize=1G \
    --conf spark.driver.memory=1G \
    --deploy-mode client \
    --conf spark.executor.memory=1G \
    --conf spark.executor.heartbeatInterval=10000000s \
    --conf spark.network.timeout=10000001s \
    --executor-cores 1 \
    --num-executors 5 \
    --packages org.apache.spark:spark-avro_2.12:3.1.2 \
    --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

Use spark-defaults.conf to configure Spark parameters

To make the configuration changes permanent, append the configuration to the file /etc/spark/conf/spark-defaults.conf. Then, restart the Spark History Server. The following example configures the executor memory and driver memory in spark-defaults.conf. In this example, each line consists of a key and a value separated by white space.

Example

spark.executor.memory 9486M
spark.driver.memory 9486M

The following example configuration configures the Spark driver and executor memory during cluster launch:

[
    {
        "Classification": "spark-defaults",
        "Properties": {
            "spark.executor.memory": "9486M",
            "spark.driver.memory": "9486M"
        }
    }
]

Note: On Amazon EMR the spark.yarn.executor.memoryOverhead configuration has a default value of 18.75% however the standard Spark default is 0.1875%. Once you configure your Spark job, monitor its performance and analyze resource utilization to gather insight and to further tune your job parameters.

Related information

AWS open data analytics

Add a Spark step

Modify your cluster on the fly with Amazon EMR reconfiguration

AWS OFFICIALUpdated 2 months ago