How do I set up an SSL connection between Hive on Amazon EMR and a metastore on Amazon RDS for MySQL?

9 minuto de leitura
0

I want to set up an SSL connection between Hive on Amazon EMR and a metastore on Amazon Relational Database Service (Amazon RDS) for MySQL.

Short description

To set up an encrypted connection between Hive and an external metastore, use an SSL certificate. You can set up an SSL connection when you launch a new Amazon EMR cluster or after the cluster is running.

Note: The following steps were tested on Amazon EMR version 7.3.0 and Amazon RDS for MySQL version 8.0.39.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

Set up an SSL connection on a new Amazon EMR cluster

Complete the following steps:

  1. To create an Amazon RDS for MySQL DB instance, run the following create-db-instance command:

    aws rds create-db-instance \
     --db-name hive \
     --db-instance-identifier example-db-instance-identifier \
     --db-instance-class db.c6gd.large \
     --engine mysql --engine-version 8.0.39 \
     --db-subnet-group-name example-subnet-group \
     --master-username example-rds-primary-user \
     --master-user-password example-rds-primary-password \
     --allocated-storage 200 \
     --storage-type gp3 \
     --vpc-security-group-ids example-rds-vpc-security-group

    Note: Replace example-db-instance-identifier with your DB instance identifier, example-subnet-group with your subnet group name, example-rds-primary-user with your Amazon RDS primary username, and example-rds-primary-password with your Amazon RDS primary password. Also, replace example-rds-vpc-security-group with your Amazon RDS Amazon Virtual Private Cloud (Amazon VPC) security group name.

  2. Connect to the Amazon RDS for MySQL DB instance as the primary user. Then, create a user for the Hive metastore:
    Note: Make sure that you restrict the primary user's access to the DB instance that you previously created.

    mysql -h example-rds-db-instance-endpoint -P 3306 -u example-rds-primary-user -p
    Enter password: example-rds-primary-password
    CREATE USER 'example-hive-username' IDENTIFIED BY 'example-hive-password' REQUIRE SSL;
    REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'example-hive-username';
    GRANT ALL PRIVILEGES ON hive.* TO 'example-hive-username';
    FLUSH PRIVILEGES;

    Note: Replace example-rds-db-instance-endpoint with your Amazon RDS DB instance endpoint, example-rds-primary-user with your Amazon RDS primary username and example-rds-primary-password with your Amazon RDS primary password. Also, replace example-hive-username with your Hive username and example-hive-password with your password for example-hive-username.

  3. Create a JSON configuration file similar to the following:

    Note: Use the following JSON file to launch the Amazon EMR cluster in the next step. The file allows Hive to establish an SSL connection to the Amazon RDS DB instance.

    [
        {
            "Classification": "hive-site",
            "Properties": {
                "javax.jdo.option.ConnectionURL": "jdbc:mysql://example-rds-db-instance-endpoint:3306/hive?createDatabaseIfNotExist=true&useSSL=true&serverSslCert=/home/hadoop/global-bundle.pem",
                "javax.jdo.option.ConnectionDriverName": "org.mariadb.jdbc.Driver",
                "javax.jdo.option.ConnectionUserName": "example-hive-username",
                "javax.jdo.option.ConnectionPassword": "example-hive-password"
            }
        }
    ]

    Note: Replace example-rds-db-instance-endpoint with your Amazon RDS DB instance endpoint, example-hive-username with your username that Hive will use to connect to Amazon RDS DB instance, and example-hive-password with your password for example-hive-username.

  4. Create an inbound rule in the security group that's associated with your Amazon RDS for MySQL instance as follows:
    For Type, choose MYSQL/Aurora (3306).
    For Protocol, TCP (6) is selected by default.
    For Port Range, 3306 is selected by default.
    For Source, enter the Group ID of the Amazon EMR managed security group that's associated with the primary node.
    Note: This rule allows the Amazon EMR cluster's primary node to access the Amazon RDS instance. For more information, see Overview of VPC security groups.

  5. Create the following bootstrap action script file. Then, upload it to an Amazon S3 bucket. The bootstrap action downloads the SSL certificate to /home/hadoop/ on the primary node.

    #!/bin/bash
    if grep isMaster /mnt/var/lib/info/instance.json | grep false;
    then        
        echo "This is not primary node, do nothing, exiting"
        exit 0
    fi
    echo "This is primary, continuing to execute script"
    cd /home/hadoop
    wget -S -T 10 -t 5 https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem
  6. To launch an Amazon EMR cluster, run the following create-cluster command with the previous JSON file and bootstrap action:

    Example:

    aws emr create-cluster \
     --name "EMR Hive Metastore SSL" \
     --log-uri $LOG_URI \
     --release-label "emr-7.3.0" \
     --service-role $EMRServiceRole \
     --ec2-attributes KeyName=$EC2_KEY_PAIR,InstanceProfile=$EMREC2Role,SubnetId=$EMR_SUBNET,EmrManagedSlaveSecurityGroup=$EMR_CORE_AND_TASK_VPC_SG,EmrManagedMasterSecurityGroup=$EMR_PRIMARY_VPC_SG \
     --applications Name=Hadoop Name=Hive \
     --bootstrap-actions Path=$BOOTSTRAP_ACTION_SCRIPT_PATH \
     --configurations file:///<Full-Path-To>/hive-ext-meta-mysql-ssl.json \
     --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"CORE","Name":"Core","InstanceType":"m5.xlarge","EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32},"VolumesPerInstance":2}]}},{"InstanceCount":1,"InstanceGroupType":"MASTER","Name":"Primary","InstanceType":"m5.xlarge","EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32},"VolumesPerInstance":2}]}},{"InstanceCount":1,"InstanceGroupType":"TASK","Name":"Task - 1","InstanceType":"m5.xlarge","EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"VolumeType":"gp2","SizeInGB":32},"VolumesPerInstance":2}]}}]'
  7. Use SSH to connect to the primary node.

  8. Open a Hive session on the primary node. Then, create a table for test purposes.

    Example:

    hive> create table tb_test (col1 STRING, col2 BIGINT);
    OK
    Time taken: 2.371 seconds
    hive> describe tb_test;
    OK
    col1                    string
    col2                    bigint
    Time taken: 0.254 seconds, Fetched: 2 row(s)
  9. To connect to the Amazon RDS for MySQL metastore, use the mysql client on the primary node. Then, check the table metadata in the metastore. If the metadata corresponds to the table that you created in the previous step, then the SSL connection works.

    Example:

    mysql -h example-rds-db-instance-endpoint -P 3306 -u example-rds-primary-user -pEnter password: example-rds-primary-password
    
    mysql> use hive;
    Database changed
    
    mysql> select t1.OWNER, t1.TBL_NAME, t1.TBL_TYPE, s1.INPUT_FORMAT, s1.OUTPUT_FORMAT, s1.LOCATION from TBLS t1 inner join SDS s1 on s1.SD_ID = t1.SD_ID where t1.TBL_NAME = 'tb_test'\G
    *************************** 1. row ***************************
            OWNER: hadoop
         TBL_NAME: tb_test
         TBL_TYPE: MANAGED_TABLE
     INPUT_FORMAT: org.apache.hadoop.mapred.TextInputFormat
    OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
         LOCATION: hdfs://ip-xxx-xx-xx-xxx.ec2.internal:8020/user/hive/warehouse/tb_test
    1 row in set (0.23 sec)
    
    mysql> select t1.OWNER, t1.TBL_NAME, c1.COLUMN_NAME, c1.TYPE_NAME from TBLS t1 inner join SDS s1 on s1.SD_ID = t1.SD_ID inner join COLUMNS_V2 c1 on c1.CD_ID = s1.CD_ID where t1.TBL_NAME = 'tb_test';
    +--------+----------+-------------+-----------+
    | OWNER  | TBL_NAME | COLUMN_NAME | TYPE_NAME |
    +--------+----------+-------------+-----------+
    | hadoop | tb_test  | col1        | string    |
    | hadoop | tb_test  | col2        | bigint    |
    +--------+----------+-------------+-----------+
    2 rows in set (0.22 sec)

    Note: Replace example-rds-db-instance-endpoint with your Amazon RDS DB instance endpoint, example-rds-primary-user with your Amazon RDS primary username and example-rds-primary-password with your Amazon RDS primary password.

Set up an SSL connection on a running Amazon EMR cluster

Note: Before you begin, make sure that you have an Amazon RDS for MySQL DB instance.

  1. To connect to the primary node, use SSH.

  2. Download the SSL certificate to /home/hadoop/ on the primary node:

    cd /home/hadoop && wget -S -T 10 -t 5 https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem
  3. In the /etc/hive/conf.dist directory, add or edit the following lines in the hive-site.xml file:

    <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://example-rds-db-instance-endpoint:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=true&amp;serverSslCert=/home/hadoop/global-bundle.pem</value>
      <description>example-rds-db-instance-endpoint</description>
    </property>
    
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>example-hive-username</value>
        <description>example-metastore-db-user</description>
    </property>
    
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>example-hive-password</value>
        <description>example-metastore-db-password</description>
    </property>

    Note: Replace example-rds-db-instance-endpoint with your Amazon RDS DB instance endpoint, example-hive-username with your username that Hive will use to connect to Amazon RDS DB instance, and example-hive-password with your password for example-hive-username. Also, the ampersand (&) is a special character in XML. To use an ampersand in hive-site.xml such as in the JDBC string, use & instead of &. If you don't use &, then you will receive an error when you restart hive-hcatalog-server.

  4. Test the SSL connection:

    mysql -h example-rds-db-instance-endpoint -P 3306 -u example-hive-username -p --ssl-ca /home/hadoop/global-bundle.pem
    Enter password: example-hive-password

    Note: Replace example-rds-db-instance-endpoint with your Amazon RDS DB instance endpoint, example-hive-username with your username that Hive will use to connect to Amazon RDS DB instance, and example-hive-password with your password for example-hive-username.

  5. Restart hive-hcatalog-server on the primary node.

  6. Confirm that the services restarted successfully:

    sudo systemctl status hive-hcatalog-server.service
  7. Open a Hive session on the primary node. Then, create a table for test purposes.

    Example:

    hive> create table tb_test (col1 STRING, col2 BIGINT);OK
    Time taken: 2.371 seconds
    
    hive> describe tb_test;
    OK
    col1                    string
    col2                    bigint
    Time taken: 0.254 seconds, Fetched: 2 row(s)
  8. To connect to the Amazon RDS for MySQL metastore, use the mysql client on the primary node. Then, check the table metadata in the metastore. If the metadata corresponds to the table that you created in the previous step, then the SSL connection works.

    Example:

    $ mysql -h example-rds-db-instance-endpoint -P 3306 -u example-rds-primary-user -p
    Enter password: example-rds-primary-password
    
    mysql> use hive;
    Database changed
    
    mysql> select t1.OWNER, t1.TBL_NAME, t1.TBL_TYPE, s1.INPUT_FORMAT, s1.OUTPUT_FORMAT, s1.LOCATION from TBLS t1 inner join SDS s1 on s1.SD_ID = t1.SD_ID where t1.TBL_NAME = 'tb_test'\G
    *************************** 1. row ***************************
            OWNER: hadoop
         TBL_NAME: tb_test
         TBL_TYPE: MANAGED_TABLE
     INPUT_FORMAT: org.apache.hadoop.mapred.TextInputFormat
    OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
         LOCATION: hdfs://ip-xxx-xx-xx-xxx.ec2.internal:8020/user/hive/warehouse/tb_test
    1 row in set (0.23 sec)
    
    mysql> select t1.OWNER, t1.TBL_NAME, c1.COLUMN_NAME, c1.TYPE_NAME from TBLS t1 inner join SDS s1 on s1.SD_ID = t1.SD_ID inner join COLUMNS_V2 c1 on c1.CD_ID = s1.CD_ID where t1.TBL_NAME = 'tb_test';
    +--------+----------+-------------+-----------+
    | OWNER  | TBL_NAME | COLUMN_NAME | TYPE_NAME |
    +--------+----------+-------------+-----------+
    | hadoop | tb_test  | col1        | string    |
    | hadoop | tb_test  | col2        | bigint    |
    +--------+----------+-------------+-----------+
    2 rows in set (0.22 sec)

    Note: Replace example-rds-db-instance-endpoint with your Amazon RDS DB instance endpoint, example-rds-primary-user with your Amazon RDS primary username and example-rds-primary-password with your Amazon RDS primary password.

Troubleshoot hive-hcatalog-server restart errors

When you try to restart hive-hcatalog-server, you might receive the following error or similar:

"2020-08-20T14:18:50,750 WARN [main] org.apache.hadoop.hive.metastore.HiveMetaStore - Retrying creating default database after error: Unable to open a test connection to the given database. JDBC url = jdbc:mysql://mysql-hive-meta.########.us-east-1.rds.amazonaws.com:3306/hive?createDatabaseIfNotExist=true&useSSL=true&serverSSlCert=/home/hadoop/global-bundle.pem, username = masteruser. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------java.sql.SQLException: Host '172.31.41.187' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'"

This error occurs when the Amazon RDS for MySQL DB instance blocks the Amazon EMR cluster's primary node as a security precaution. To resolve this error, complete the following steps:

  1. Connect to a different local machine or Amazon Elastic Compute Cloud (Amazon EC2) instance that has the mysqladmin tool installed.
  2. Flush the primary node from the DB instance:
    mysqladmin -h example-rds-db-instance-endpoint -P 3306 -u example-rds-primary-username -p flush-hosts
    Enter password: example-rds-primary-password
    Note: Replace example-rds-primary-username with your Amazon RDS primary username and example-rds-primary-password with your Amazon RDS primary password.
  3. Restart hive-hcatalog-server.

Related information

Using an external MySQL database or Amazon Aurora

AWS OFICIAL
AWS OFICIALAtualizada há 9 dias