Master your data quality in Amazon Redshift

9 minutos de lectura

Nivel de contenido: Intermedio

This article provides an overview of using Glue Data Quality with Amazon Redshift tables.

Introduction

Data warehouses store curated datasets for applications in analytics, machine learning (ML) and artificial intelligence (AI).

Technology developments like Zero-ETL and Redshift Streaming have made it easier than ever to ingest all of your data into Amazon Redshift, but before data can be integrated to generate insights or feed automated processes, it’s important to ensure the quality of data meets the needs of your use cases.

In this article we demonstrate how you can use AWS Glue Data Quality to create and apply data quality rules for datasets that reside in Amazon Redshift.

Scenario

Consider a fictitious event promotion business, AnyCompany, who sell bulk ticket packages for promoted events at a discounted rate. AnyCompany use Amazon Redshift to store available listing packages, previous sales data and demographic information relating to both buyers and sellers of ticket packages.

AnyCompany would like to use machine learning and automation to streamline the process of identifying high propensity buyers and presenting them with the latest offers that match their profile. Implementation of their automated offer process requires mitigation of two key risks:

Erroneous data being sent to their customers (e.g. wrong price, event name, customer details)
Broken integration pipelines that prevent arrival of new information (e.g. new ticket listings or updated buyer contact details)

AnyCompany can use AWS Glue Data Quality to mitigate the above risks and enable their use case.

AWS Glue Data Quality

AWS Glue Data Quality allows you to measure and monitor the quality of your data so that you can make good business decisions. AWS Glue Data Quality works with Data Quality Definition Language (DQDL), which is a domain specific language that you use to define data quality rules in either the AWS Glue Data Catalog, or AWS Glue ETL.

Attributes of your data may be tested for multiple separate rules (e.g. completeness, data type), or the rules can be combined using logical operators (e.g. length and uniqueness). Dynamic rules compare current metrics produced by your rules with their historical values. This enables implementation of rules that, for example, flag anomalous batch row counts or confirm that the sum of a numerical field increases over batch cycles. To learn more about DQDL and supported rule types, see Data Quality Definition Language (DQDL) reference.

This article will implement data quality rules using AWS Glue ETL to allow for execution of dynamic rules.

Pre-requisites

To work through the example scenario in this article, please ensure you have:

An Amazon Redshift endpoint (this may be either serverless or provisioned).
An Amazon S3 bucket that can be used to stage data.
The following VPC endpoints in the same VPC as your Redshift endpoint:
- AWS Glue
- Amazon Redshift
- Amazon S3
- Monitoring (Amazon CloudWatch)
A default IAM role attached to the Redshift cluster that can access the file located at s3://awssampledbuswest2/tickit/listings_pipe.txt
- Creating an IAM role as default for Amazon Redshift
A Redshift database user with the following permissions:
- Create schema
- Create table
An IAM role that can be assume by AWS Glue, with permissions to do the following:
- Access Redshift
- Access S3
- Access CloudWatch
- Access EventBridge

Architecture

The solution depicted below operates as follows:

Data Quality job is invoked by schedule, API call or manual execution
Required data is located in Amazon Redshift and staged in a temporary location for scanning (using Amazon S3)
Results of Data Quality scans are visible via the Glue ETL ‘Data Quality’ tab, are sent to Amazon CloudWatch where automated alerts or customised actions can be taken based on Data Quality rule violations. Automated alerts & actions are beyond the scope of this article but are described in detail in the AWS Glue User Guide.

Enter image description here

Implementation steps - Amazon Redshift

For the purpose of this article we will limit our data quality scans to a single entity tickit_listings, which represents AnyCompany’s table of ticket packages available for promotion. The tickit_listings table is sourced from a publicly available data set that can be loaded into your Amazon Redshift endpoint by executing the below SQL statements which implements these steps:

Create a new schema and table for the dataset
COPY the data from a public S3 bucket

CREATE SCHEMA dq_testing;

CREATE TABLE dq_testing.tickit_listing (
    listid integer NOT NULL distkey,
    sellerid integer NOT NULL,
    eventid integer NOT NULL,
    dateid smallint NOT NULL,
    numtickets smallint NOT NULL,
    priceperticket numeric(8, 2),
    totalprice numeric(8, 2),
    listtime timestamp without time zone,
    PRIMARY KEY (listid)
) DISTSTYLE KEY
SORTKEY (dateid);

COPY dq_testing.tickit_listing
FROM 's3://awssampledbuswest2/tickit/listings_pipe.txt'
IAM_ROLE default
REGION 'us-west-2';

/*
Expected output:
    Load into table 'tickit_listing' completed, 192497 record(s) loaded successfully.
*/

Implementation steps - AWS Glue Data Catalog

In order to read the contents of the tickit_listings table in Redshift, a connection to the Redshift endpoint must be created in the Glue Data Catalog. To do this:

Navigate to Glue in the console, then select Connections from the menu bar

Navigate to connection in console.

Select Create connection under the Connections section

Create connection

Follow the prompts to create a Redshift connection

With your connection created, you may now create a data quality job using AWS Glue ETL.

Implementation steps - AWS Glue ETL

Navigate to the Glue console > ETL jobs > Create job > Visual ETL

Enter image description here

Provide a job name, then add a new node of type: Amazon Redshift Source

Enter image description here

Select the node and enter required details. Use the Redshift connection created in the previous section of this article.
Choose the tickit_listings table created earlier.
Under Performance and security select the S3 bucket to be used for temporary unloads (Pre-requisite 2) and an IAM role that has appropriate permissions to unload data from Redshift (Pre-requisite 6).
- It is recommended that this S3 bucket have a storage lifecycle policy which removes data after a predefined period of time. Having this policy will reduce S3 storage costs.
Add a new node of type: Evaluate data quality

Enter image description here

Select the node and Implement the below rules to validate that
- New listings contain at least 1 ticket: ColumnValues "numtickets" > 0
- Total listing count does not fall below 80% of the average count captured from the trailing 3 executions: RowCount > avg(last(3))*0.80

Enter image description here

Save the job and run. Navigate to the Runs tab to view job progress statistics.

Results

To inspect the results of the data quality run within AWS Glue ETL, simply navigate to the Data Quality tab, as shown below:

Enter image description here

The results confirm that both data quality rules have been satisfied during this execution. Results are sent to Amazon CloudWatch by default. Events can also be sent to Amazon EventBridge allowing for proactive alerting and actions to be taken if future executions detect rule violations.

Further enhancements & design considerations

Now that you have been able to create data quality rules in Glue on your Redshift tables, there are multiple next steps available including:

Write the result of the Glue Data Quality jobs to S3 or Redshift. This would allow you to incorporate the results into subsequent workflows.
Set up alerts and notifications using Amazon EventBridge to automatically notify relevant parties the result of data quality rules.
Incorporate data quality processes into your workflows so that downstream components are dependent on objects passing the data quality rules.
Extend workflows to utilise Glue PII. Glue PII allows you to identify and transform personally identifiable information (PII), allowing you to choose exactly what you want to do with PII data through your workflows (this is an available node type in AWS Glue ETL).
Implement referential integrity checks between multiple objects using Glue ETL.
If using Glue ETL for pre-processing of data prior to ingestion into Amazon Redshift, consider adding data quality checks to avoid creation and management of separate data quality jobs. Executing data quality jobs as early as possible in your data pipelines will help limit the spread of poor quality data through your platform and downstream applications.
Leverage Glue flexible execution to benefit from cost savings when running non-critical data quality assessments.

If you would like to see any of these enhancements, please let us know in the comments.

Conclusion

Data quality is an important part of getting the most out of your data assets. Through reading this article you have learned how you can easily apply static and dynamic data quality rules to tables in your data warehouse. This allows you to implement proactive alerting and take corrective action when data quality issues arise without the need to wait for manual feedback from users.

Implementing data quality checks allows you to be confident in your data when using it for downstream applications, including analytics, ML, AI powered use-cases.

Clean up

Note that as all services used in this article are serverless there will be no costs incurred while you are not executing the data quality workflows (except for storage costs in S3 and Redshift). To clean up your environment, do the following.

Delete the data in your S3 bucket used for staging data
Delete the created Glue ETL job
Delete the created Glue connection
Run the following SQL statement on your Redshift endpoint drop schema dq_testing cascade;

Authors

Rick Fraser - Specialist Data & AI Solutions Architect

Sean Beath - Specialist Redshift Solutions Architect

Temas

Análisis Gestión y gobierno

Etiquetas

Análisis Gestión y gobierno AWS Glue Amazon Redshift

Idioma

English

EXPERTO

Sean Beath

publicado hace un mes709 visualizaciones

Sin comentarios

Contenido relevante

¿Por qué aparece el error «The snapshot is currently in use by an AMI» cuando intento eliminar mi instantánea de Amazon EBS?
OFICIAL DE AWSActualizada hace 5 meses
¿Cómo puedo solucionar el error "SYNTAX_ERROR: line 1:8: SELECT * not allowed in queries without FROM clause» en Amazon Athena?
OFICIAL DE AWSActualizada hace un año
¿Por qué aparece el error «Invalid principal in policy» cuando intento actualizar mi política de bucket de Amazon S3?
OFICIAL DE AWSActualizada hace un año
¿Cómo puedo migrar mis datos de Amazon Redshift de un clúster a otro?
OFICIAL DE AWSActualizada hace un año