I want to set up a Spark SQL Java Database Connectivity (JDBC) connection on Amazon EMR.
Resolution
Note: The following steps use SQuirreL SQL Client for SQL client on a local machine.
To set up a Spark SQL JDBC connection on Amazon EMR, complete the following steps:
-
Download and install SQuirreL SQL Client. For more information, see Download and installation on the SQuirreL SQL website.
-
To connect to the Amazon EMR primary node, use SSH.
-
To start Spark Thrift Server, run the following command on the primary node:
sudo /usr/lib/spark/sbin/start-thriftserver.sh
-
On the primary node, copy all .jar files from the /usr/lib/spark/jars directory to your local machine.
-
Open SQuirreL SQL Client, and then create a new driver as follows:
For Name, enter Spark JDBC Driver.
For Example URL, enter jdbc:hive2://localhost:10001.
-
On the JDBC Driver Class Path tab, choose Add.
-
In the dialog box, navigate to the directory that you previously copied the .jar files to, and then select all the files.
-
In the Class Name field, enter org.apache.hive.jdbc.HiveDriver, and then choose OK.
-
To set up an SSH tunnel, use local port forwarding on your local machine:
ssh -o ServerAliveInterval=10 -i path-to-key-file -N -L 10001:localhost:10001 hadoop@example-primary-public-dns
Note: Replace example-primary-public-dns with your primary public DNS name.
-
To connect to the Spark Thrift Server, create a new alias in SQuirreL SQL Client as follows:
For Name, enter Spark JDBC.
For Driver, enter Spark JDBC Driver.
For URL, enter jdbc:hive2://localhost:10001.
For Username, enter hadoop.
Note: You can now run queries from SQuirreL SQL Client.