How do I use Spark in my Amazon EMR cluster to connect to an Amazon Redshift cluster?

2 minute read
0

I want to use Spark in my Amazon EMR cluster to connect to an Amazon Redshift cluster.

Resolution

Note: Before you begin, make sure that you configure your Amazon Redshift cluster. Then, use Spark to launch an Amazon EMR 6.4 or later cluster.

To use Spark in your Amazon EMR cluster to connect to an Amazon Redshift cluster, complete the following steps:

Test the connectivity between the Amazon EMR cluster and Amazon Redshift cluster

  1. Confirm that the Amazon Redshift cluster's security group inbound rules allow the Amazon EMR primary, core, and task security groups on TCP port 5439.
    Note: If your clusters are deployed in two different Amazon Virtual Private Clouds (Amazon VPC), then configure Amazon VPC peering.

  2. To connect to the Amazon EMR primary node, use SSH. Then, run the following telnet command:

    telnet example-redshift-endpoint 5439

    Note: Replace example-redshift-endpoint with your Amazon Redshift cluster endpoint.

    Example output:

    telnet redshift-cluster-1.XXXXX.us-east-1.redshift.amazonaws.com 5439
    Trying 172.31.48.21...
    Connected to redshift-cluster-1.XXXXX.us-east-1.redshift.amazonaws.com.
    Escape character is '^]'.

Use Spark in Amazon EMR 6.4.0 or later to connect to your Amazon Redshift cluster

Complete the following steps:

  1. To connect to the Amazon EMR primary node, use SSH.
  2. To read an Amazon Redshift table, use the JDBC connector. For more information, see Connector parameters.
    PySpark example:
    [hadoop@ip-10-10-10-10 ~]$ pyspark
    
    >>> from pyspark.sql import SQLContext
    
    >>> sql_context = SQLContext(sc)
    
    >>> url = "jdbc:redshift://<redshift-endpoint>:5439/dev?user=<user>&password=<password>"
    
    >>> df = sql_context.read \
        .format("io.github.spark_redshift_community.spark.redshift") \
        .option("url", url) \
        .option("query", "select * from table") \
        .option("tempdir", "<tempdir>") \
        .option("aws_iam_role", "<iam_role>") \
        .load()
    
    >>> df.show(2)
    
    Spark-shell example:
    [hadoop@ip-192-168-1-113 ~]$ spark-shell
    
    scala> import org.apache.spark.sql._
    
    scala> val sql_context = new SQLContext(sc)
    
    scala> val url = "jdbc:redshift://<redshift-endpoint>:5439/dev?user=<user>&password=<password>"
    
    scala> val df = sql_context.read.format("io.github.spark_redshift_community.spark.redshift").option("url", url).option("query", "select * from table").option("tempdir", "<tempdir>").option("aws_iam_role", "<iam_role>").load()
    
    scala> df.show(2)

Note: For additional secure options to manage user credentials that are used in the JDBC connection, see Identity and access management in Amazon Redshift.

Related information

Using Amazon Redshift integration for Apache Spark with Amazon EMR

AWS OFFICIAL
AWS OFFICIALUpdated 4 months ago