How can I use AWS DMS to migrate data to Amazon S3 in Parquet format?

2 minute read
0

I want to use AWS Database Migration Service (AWS DMS) to migrate data in Apache Parquet (.parquet) format to Amazon Simple Storage Service (Amazon S3).

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.

If you use replication version 3.1.3 or later, then use AWS DMS to migrate data to an S3 bucket in Apache Parquet format. The default Parquet version is Parquet 1.0.

1.    In the AWS DMS console, create a target Amazon S3 endpoint, and then add an extra connection attribute. Also, check the other extra connection attributes that you can use to store parquet objects in an S3 target:

dataFormat=parquet;

Or, run the create-endpoint command in the AWS CLI to create a target Amazon S3 endpoint:

aws dms create-endpoint --endpoint-identifier s3-target-parque --engine-name s3 --endpoint-type target --s3-settings '{"ServiceAccessRoleArn": <IAM role ARN for S3 endpoint>, "BucketName": <S3 bucket name to migrate to>, "DataFormat": "parquet"}'

2.    To specify the .parquet output file, use the following extra connection attribute:

parquetVersion=PARQUET_2_0;

3.    Run the describe-endpoints command to check whether the S3 setting DataFormat or the extra connection attribute dataFormat is set to parquet in the S3 endpoint:

aws dms describe-endpoints --filters Name=endpoint-arn,Values=<S3 target endpoint ARN> --query "Endpoints[].S3Settings.DataFormat"
[
    "parquet"
]

4.    If the value of the DataFormat parameter is CSV, then recreate the endpoint.

5.    Install the Apache Parquet command line tool to parse the output file:

pip install parquet-cli --user

6.    Inspect the file format:

parq LOAD00000001.parquet  # Metadata
  <pyarrow._parquet.FileMetaData object at 0x10e948aa0>
  created_by: AWS
  num_columns: 2
  num_rows: 2
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 169

7.    Print the file content:

parq LOAD00000001.parquet --head   i        c
0  1  insert1
1  2  insert2

Related information

Using Amazon S3 as a target for AWS Database Migration Service

AWS OFFICIAL
AWS OFFICIALUpdated a year ago