Greenplum to Redshift Migration Tip

7 minute read
Content level: Advanced
1

Overcoming Migrating TEXT data

As the former "guru" of Greenplum, I have been finding myself helping more and more Greenplum customers migrate to Amazon Redshift. Customers are finding that their journey to the cloud is simplified and less expensive by using Amazon Redshift which provides the best price performance at any scale of other cloud data warehouses.

The first place to start the migration is with the AWS Schema Conversion Tool (SCT). It simplifies the conversion and the migration of Greenplum to Redshift with a graphical user interface that literally makes the migration a point and click endeavor.

There have been 3 blog posts (Part 1, Part 2, and Part 3) that cover the migration of Greenplum to Redshift in great detail so, please check it out. This post is focused on a workaround to a challenge discussed in the 3rd blog post. The challenge of handling TEXT and unbounded VARCHAR columns.

I have found that it is common for a Greenplum environment to use TEXT or just VARHCAR for a column rather than specifying the length of the VARCHAR. Currently, SCT handles these columns as LOBs which puts the data in S3 rather than loading directly into the table. This may change in the future but for now, this is a challenge.

One solution is to alter the Greenplum database to specify the lengths of the VARCHAR columns. This is discussed in Part 3 of the blog series and that blog post also includes a simple script that converts columns in Greenplum to VARCHAR(n). However, you may not want to alter the Greenplum database, so in that case, you can follow these steps.

  1. Use SCT to convert the Greenplum schemas to Redshift. This is covered in detail in the 3-part blog series referenced above.
  2. Configure an IAM User to allow access to load data into S3.
  3. Create an S3 Bucket.
  4. Configure Greenplum to enable External Tables with S3.
  5. Create External Tables in Greenplum to unload data.
  6. Unload the data from Greenplum to S3.
  7. Load the data from S3 to Redshift using COPY.

IAM User

In the AWS console, navigate to IAM and on the left pane, click Users. On the right, click "Add users". Add the "gpadmin" user. Add "AmazonS3FullAccess" to the user.

Specify User Details

Click Next and then Create user.

Click on the "gpadmin" user and then click Security Credentials. Click, Create access key.

Security Credentials

Choose Third-party service, Next, and then Create access key. Click Show for the Secret access key. Copy both and save these. Click Done.

S3 Bucket

In the AWS console, navigate to S3 and click Create bucket. Pick a unique bucket name and in an AWS Region that is appropriate. If your Greenplum cluster is already in AWS, use the same Region. Use the defaults for the rest and then click Create bucket.

Greenplum Setup

Connect to the Greenplum Master node with ssh as the gpadmin user and execute the gpcheckcloud command as follows:

gpcheckcloud -t > s3.conf

Next step is to edit the s3.conf file. Make the following configuration changes and leave the other settings as-is.

secret = "Secret access key"
accessid = "Access key"
encryption = false

Copy the s3.conf file to each Segment host. Example below has a file named segment_hosts.txt which contains a single line for each segment host in the cluster.

gpscp -f segment_hosts.txt s3.conf =:/home/gpadmin/

Create the following database objects in Greenplum. Greenplum provides these libraries but requires you to create these two functions and one protocol object. Log in as gpadmin to the Greenplum database with psql and execute the following SQL commands:

CREATE OR REPLACE FUNCTION write_to_s3() RETURNS integer AS '$libdir/gps3ext.so', 's3_export' LANGUAGE C STABLE;
CREATE OR REPLACE FUNCTION read_from_s3() RETURNS integer AS '$libdir/gps3ext.so', 's3_import' LANGUAGE C STABLE;
CREATE PROTOCOL s3 (writefunc = write_to_s3, readfunc = read_from_s3);

Greenplum External Tables

In this example, I have a table called tpcds.call_center that I want to copy from Greenplum to Redshift. I already used SCT to create the new table in Redshift so the Redshift table already exists.

In Greenplum, create a schema for the external tables. Log into Greenplum as gpadmin with psql and execute the following:

CREATE SCHEMA ext_tpcds;

Next, create the external table:

CREATE WRITABLE EXTERNAL TABLE ext_tpcds.call_center 
(like tpcds.call_center) 
LOCATION ('s3://s3-us-east-2.amazonaws.com/demo-gpadmin/call_center config=/home/gpadmin/s3.conf') FORMAT 'csv';

Now, you can copy the data from Greenplum to S3.

INSERT INTO ext_tpcds.call_center SELECT * FROM tpcds.call_center;

Extra: Automating External Tables

Here is a quick way to create these external tables for a given schema. You simply create a SQL file on Greenplum Master node and then execute it.

  1. Create create_tables.sql with the following:
select sqltext 
from (
select 'drop schema if exists ext_' || :schema_name || ' cascade;' as sqltext, 1 as id
union all
select 'create schema ext_' || :schema_name || ';', 2
union all
select 'CREATE WRITABLE EXTERNAL TABLE ext_' || :schema_name || '.' || c.relname ||
' (like ' || :schema_name || '.' || c.relname ||
') location (''s3://s3-' || :region || '.amazonaws.com/' ||
:bucket || '/' || c.relname || '/ config=' || :config || ''') format ''csv'';', 3
from pg_class c
join pg_namespace n on c.relnamespace = n.oid
left outer join pg_partitions p on c.relname = p.partitiontablename and n.nspname = p.schemaname
where n.nspname = :schema_name
and c.relkind = 'r'
and p.tablename is null
) as sub
order by id, sqltext;
  1. Execute the script. In this example, I'm creating using the following variables. Note: be sure to use the quoting as shown below.
    • schema_name="'tpcds'"
    • region="'us-east-2'"
    • bucket="'demo-gpadmin'"
    • config="'/home/gpadmin/s3.conf'"
psql -t -A -f create_tables.sql -v schema_name="'tpcds'" \
-v region="'us-east-2'" \
-v bucket="demo-gpadmin'" \
-v config="'/home/gpadmin/s3.conf'" | psql -e

Extra: Automating Data Unloading

Here is a quick way to unload data from Greenplum to S3 for a given schema. You simply create a SQL file on Greenplum Master node and then execute it.

  1. Create unload_tables.sql with the following:
select 'insert into ext_' || :schema_name || '.' || c.relname ||                               
' select * from ' || :schema_name || '.' || c.relname || ';'
from pg_class c
join pg_namespace n on c.relnamespace = n.oid
left outer join pg_partitions p on c.relname = p.partitiontablename and n.nspname = p.schemaname
where n.nspname = :schema_name
and c.relkind = 'r'
and p.tablename is null
order by c.relname;
  1. Execute the script. In this example, I'm creating using the following variables. Note: be sure to use the quoting as shown below.
    • schema_name="'tpcds'"
psql -t -A -f unload_tables.sql -v schema_name="'tpcds'" | psql -e

Redshift Loading

Now that the data is in S3, you can COPY the data to Redshift. In my example, my bucket is in the us-east-2 region so be sure to change it to your region. Log into Redshift and execute the following SQL:

COPY tpcds.call_center FROM 's3://demo-gpadmin/call_center/' IAM_ROLE DEFAULT REGION 'us-east-2' FORMAT CSV GZIP;

In this example, the default IAM Role is being used for simplification. You can find out more on this here: https://docs.aws.amazon.com/redshift/latest/mgmt/default-iam-role.html

Summay

That's it! By using Greenplum's external tables, you can easily overcome the TEXT and unbounded VARCHAR column challenge to migrate your Greenplum data to Amazon Redshift.