How do I use an AWS Glue database table in a different account to convert record formats in a Kinesis Data Firehose delivery stream?

6 minute read
0

I want to use an AWS Glue database table in a different AWS account to convert record formats within an Amazon Kinesis Data Firehose delivery stream.

Resolution

To implement a Kinesis Data Firehose delivery stream record format conversion with an AWS Glue database table in a different account, complete the following steps.

The steps include example Account A and example Account B:

  • Use Account A to create the Amazon Simple Storage Service (Amazon S3) bucket and AWS Glue Data Catalog, database, and table. Use the Amazon S3 bucket in Account A to store the AWS Glue sample data.
  • Use Account B to create the AWS Identity and Access Management (IAM) role and Firehose delivery stream. Also, use Account B to create an Amazon S3 bucket as the destination of the delivery stream.

Create or locate an Amazon S3 bucket

You must create or use two existing Amazon S3 buckets:

  • In Account A, create or use an existing Amazon S3 bucket to store the AWS Glue sample data.
  • In Account B, create or use an existing Amazon S3 bucket as the destination of the delivery stream.

Upload a sample JSON file to Account A

Take a sample JSON file, and upload it to the Amazon S3 bucket for Account A.

Add and run a crawler in Account A

In Account A, add a crawler. Then, run the crawler.

Create the delivery stream in Account B

Important:

  • To automatically create an IAM role with the permissions to access the delivery stream, use the Kinesis console.
  • An Amazon S3 bucket is the only destination that you can use for delivery stream record format conversions.
  • When you create the delivery stream, don't configure data conversion. Do this in the following section.

Create a Data Firehose stream in Account B, and then choose Amazon S3 for your destination. Make sure that you choose the Amazon S3 bucket that's located in Account B.

Modify the new Data Firehose delivery stream IAM role in Account B

Use Account B to complete the following steps:

1.    Open the Kinesis Data Firehose console, choose the Configuration tab on the details page of the delivery stream.

2.    Under Service access, choose the IAM role link.

3.    Add the following policy to your IAM role to use the AWS Glue cross-account catalog, database, and table:

For more information, see Modifying a role permissions policy (console).

   {  
       "Effect": "Allow",  
       "Action": [  
           "glue:GetDatabase",  
           "glue:GetTable",  
           "glue:GetTableVersion"  
       ],  
       "Resource": [  
           "arn:aws:glue:<region>:<accountA-id>:catalog",  
           "arn:aws:glue:<region>:<accountA-id>:database/<databaseName>",  
           "arn:aws:glue:<region>:<accountA-id>:table/<databaseName>/<tableName>"  
       ]  
   }

Add or update the AWS Glue Data Catalog resource policy in Account A

Complete the following steps:

1.    Open the AWS Glue console.

2.    In the navigation pane, choose Data Catalog, and then choose Catalog Settings.

3.    For Permissions, enter the following example policy:

{  
    "Version": "2012-10-17",  
    "Statement": [{  
        "Effect": "Allow",  
        "Principal": {  
            "AWS": "<Firehose IAM ROLE ARN>"  
        },  
        "Action": [  
            "glue:GetDatabase",  
            "glue:GetTable",  
            "glue:GetTableVersion",  
            "glue:GetTableVersions"  
        ],  
        "Resource": [  
            "arn:aws:glue:<region>:<accountA-id>:catalog",  
            "arn:aws:glue:<region>:<accountA-id>:database/<databaseName>",  
            "arn:aws:glue:<region>:<accountA-id>:table/<databaseName>/<tableName>"  
        ]  
    }]  
}

4.    Choose Save.

Note: For more information, see Specifying AWS Glue resource ARNs.

The Data Firehose IAM role can now communicate with the AWS Glue database in a different account.

Turn on cross-account record format conversion

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.

Complete the following steps:

1.    Run the describe-delivery-stream AWS CLI command:

aws firehose describe-delivery-stream —delivery-stream-name <deliveryStreamName>

2.     Create a JSON file that's named createDeliveryStream to update the delivery stream's record format conversion attributes. Use the output information of the previous command and the attributes of your AWS Glue database. Also, use the following example as a reference for the required attributes:

{  
    "CurrentDeliveryStreamVersionId": "<versionID>",  
    "DestinationId": "<DestinationId>",  
    "ExtendedS3DestinationUpdate": {  
        "BufferingHints": {  
            "SizeInMBs": 128,  
            "IntervalInSeconds": 300  
        },  
        "DataFormatConversionConfiguration": {  
            "SchemaConfiguration": {  
                "RoleARN": "<deliveryStreamIAMroleARN>",  
                "DatabaseName": "<DatabaseName>",  
                "CatalogId": "<accountA-id>",  
                "TableName": "<TableName>",  
                "Region": "<region>",  
                "VersionId": "LATEST"  
            },  
            "InputFormatConfiguration": {  
                "Deserializer": {  
                    "OpenXJsonSerDe": {}  
                }  
            },  
            "OutputFormatConfiguration": {  
                "Serializer": {  
                    "ParquetSerDe": {}  
                }  
            },  
            "Enabled": true  
        }  
    }  
}

Note: For additional information on BufferingHints restrictions, see Converting input record format (API).

3.    Run the update-destination AWS CLI command:

Note: Replace deliveryStreamName with the output from step 1. The AWS CLI interprets the --cli-input-json parameter with file:// prefix as the location of a file that's relative to your current directory.

aws firehose update-destination --delivery-stream-name <deliveryStreamName> --cli-input-json file://<createDevlieryStream>.json

Use the KDG to send the JSON data to the designated delivery stream

To set up the delivery stream test data, use the Kinesis Data Generator (KDG). For instructions, see Test your streaming data solution with the new Amazon Kinesis Data Generator.

Important: The KGD data must match the same data structure as the sample JSON file for the crawler in the preceding section.

After you establish the test data, send the data. Check the destination S3 bucket in Account B to find the converted data. Then, use Amazon CloudWatch to monitor the metrics. For format conversion metrics information, see Format conversion CloudWatch metrics.

If the record format conversion is successful, then you see a number greater than zero on the SucceedConversion.Records metrics.

Troubleshooting

If you see null column values from the converted output files in the Amazon S3 bucket, then there's a mismatch in the mapping fields. To resolve this issue, confirm that the payload field structure aligns with the AWS Glue table fields. You can use the AWS Glue crawler to automatically extract and define the field mapping.

You might see the following error when you run the CreateDeliveryStream operation:

"Access was denied when calling Glue. Please ensure that the role specified in the data format conversion configuration has the necessary permissions."

This error can occur in the following scenarios:

  • The permissions or resources within the Data Firehose IAM role are incorrect. To resolve this issue, specify the correct permissions and resources.
  • The AWS Glue Data Catalog permissions are incorrect. To resolve this issue, specify the correct Kinesis Data Firehose IAM role ARN. Also, specify the correct resources in the AWS Glue Data Catalog permissions.
  • The configurations in the DataFormatConversionConfiguration are incorrect. To resolve this issue, specify the correct database name, catalog ID, and table name.
AWS OFFICIAL
AWS OFFICIALUpdated 6 months ago