Best Practices for loading Electronic Health Claims Data to Amazon Redshift

4 分的閱讀內容
內容層級:進階
0

This article demonstrates techniques to load very large datasets as quick as possible and in a cost-effective manner

Amazon Redshift is a fast petabyte-scale data warehouse that enables you to quickly analyze very large datasets to gain insights into your business. Many AWS Healthcare and Life Sciences (HCLS) customers leverage Electronic Health Information (EHI) for various analytic use cases such as patient journey. This data can be up to 100 TB or more and is received on an a periodic basis such as monthly or quarterly. This data needs to be loaded as quick as possible which requires more resources. With Amazon Redshift Serverless you can scale to accommodate these periodic workloads without having to setup and manage data warehouse clusters.

The high level process below illustrates this process.

High Level Process Flow

In this flow we are focusing on health claims data which can be up to 40 TB with up to 350 Billion rows. We are using Amazon Redshift Serverless with the max rpu setting of 512 to load the data and then we create a data share that a smaller Redshift Serverless data warehouse can consume. By using Amazon Redshift Serverless with the max rpu setting you have the resources available to load this data in hours instead of days. When the data load is completed you are no longer incurring costs on the larger Redshift serverless data warehouse. By using data sharing, the ETL is isolated from the analytic queries so that no users are impacted while the load are running and costs are managed by running analytics on a smaller Redshift serverless.

Health claims data

In addition to loading the data as quick as possible, it needs to be consumable in a timely manner as well. Health claims data will have a unique identifier - in our example here we will use a column called 'upid'. This column should be defined as the distribution key, since it is very likely to be frequently joined with other tables on 'upid'. Also, the 'upid' will be typically used as a filter in a 'where' condition and will also be a sort key to greatly reduce data scans. With these very large datasets it is advisable to apply the sort key in a post-load step to avoid disk spill that can potentially slow down the copy job. Let's walk through the process.

Create table without sort key for initial load

Below is a snippet of the health_claims table - note that we have distribution key on upid with no sort key

CREATE TABLE public.health_claims ( record_id bigint ENCODE az64, claim_id character varying(128) ENCODE lzo, mc_enc_id character varying(128) ENCODE lzo, upid character varying(32) ENCODE lzo distkey, created date ENCODE az64, …) DISTSTYLE KEY;

load data from S3

You can use the following copy command and modify with your table name. You should use a manifest file that has the locations of your parquet files. For more information of using manifest files see here.

copy public.health_claims from 's3://999999999999-prototype/health_claimsmanifest' iam_role default format parquet MANIFEST;

Now, you are ready to create a table with the sort key

Below we have the code snippet of the sorted table. Note the table will be identical as the table we loaded via copy command except for having a sort key of 'upid'

CREATE TABLE public.health_claims_sorted ( record_id bigint ENCODE az64, claim_id character varying(128) ENCODE lzo, mc_enc_id character varying(128) ENCODE lzo, upid character varying(32) ENCODE lzo distkey, created date ENCODE az64, …) DISTSTYLE KEY SORTKEY (upid);

Load table with sort key

The unique identifiers associated with electronic health care data are typically defined as character containing 32 hex characters. Since a typical dataset will contain upid values across the spectrum of possible values we can use a upid of '8' to split our data.

Run the following steps sequentially:

insert into public.health_claims_sorted select * from public.health_claims where upid <= '8';

insert into public.health_claims_sorted select * from public.health_claims where upid > '8' ;

You now have a table ready for your analytics users to consume !

Summary

By using Amazon Redshift Serverless and applying the sort key after the copy step, you can load these large electronic health datasets cost effectively in hours instead of days and without having to manage infrastructure.

profile pictureAWS
專家
已發佈 1 年前檢視次數 2854 次