This document lists the options for improving the JDBC source query performance from AWS Glue dynamic frame by adding additional configuration parameters to the ‘from catalog’ method of ‘create dynamic frame’ function
AWS Glue Dynamic Frame – JDBC Performance Tuning Configuration
This document lists the options for improving the JDBC source query performance from AWS Glue dynamic frame by adding additional configuration parameters to the ‘from catalog’ method of ‘create dynamic frame’ function. Four different options are discussed as given below
-
Parallel JDBC Query Execution (Single Table)
-
Push Down Predicate (Single Table)
-
Query Push Down (Multi Table)
-
Parallel JDBC Query Execution
By default JDBC queries create a single connection while fetching data from the source but the number of parallel executions can be changed by using the ‘hashfield’ and ‘hashpartitions’ parameters within the ‘from_catalog’ method as given below.
‘hashfield’: Set hashfield to the name of a column in the JDBC table to be used to divide the data into partitions. For best results, this column should have an even distribution of values to spread the data between partitions. This column can be of any data type.
‘haspartitions’: Set hashpartitions to the number of parallel reads of the JDBC table. If this property is not set, the default value is 7.
Notes:
-
‘hashpartitions’ won’t work with Oracle databases.
-
‘hashexpression’ can be used instead of the ‘hashfield’ too
Code Snippet:
JDBC_DF = glueContext.create_dynamic_frame.from_catalog(
database="dms",
table_name="dms_large_dbo_person",
transformation_ctx="JDBC_DF",
additional_options = {
'hashfield': 'last_name',
'hashpartitions': '10'
}
)
-
Predicate Push Down
Above explained parallel execution option allows users to increase the parallel executions but if users want to filter the data at the source ‘push down predicate’ parameters as given below would be the option.
‘hashexpression’: AWS Glue generates SQL queries to read the JDBC data in parallel using the hashexpression in the WHERE clause to partition data.
‘enablePartitioningForSampleQuery’: By default this option is false. Required if you want to use sampleQuery with a partitioned JDBC table. If set to true, sampleQuery must end with "where" or "and" for AWS Glue to append partitioning conditions
‘sampleQuery’: The custom SQL query statement for sampling. By default the sample query is executed by single executor. If you're reading a large dataset, you may need to enable JDBC partitioning to query a table in parallel.
Code Snippet:
JDBC_DF_PDP = glueContext.create_dynamic_frame.from_catalog(
database="dms",
table_name="dms_large_dbo_person",
transformation_ctx="JDBC_DF_PDP",
additional_options = { "hashexpression":"id", "enablePartitioningForSampleQuery":True, "sampleQuery":"select * from person where last_name <> 'rb' and"}
)
Screenshot 3: Spark UI with 7 executors.
Default partition/executor count of 7 is used, users can modify it by adding the ‘hashpartitions’ parameter to the dynamic frame script.
3. Query Push Down
Queries with multiple tables can be executed with a similar approach as the second option as given below.
‘hashexpression’: AWS Glue generates SQL queries to read the JDBC data in parallel using the hashexpression in the WHERE clause to partition data. In case of queries with multiple tables assign a numeric attribute with distinct identifier (table_name.column_name, as given below).
Code Snippet:
JDBC_DF_QUERY = glueContext.create_dynamic_frame.from_catalog(
database="dms_large",
table_name="dms_large_dbo_sporting_event",
transformation_ctx="JDBC_DF_QUERY",
additional_options = {"hashpartitions": "20" ,"hashfiled":"pr.id","hashexpression":"pr.id", "enablePartitioningForSampleQuery":True, "sampleQuery":"select pr.id, fl.full_name from dms_large.dbo.person pr inner join dms_large.dbo.person_full fl on pr.id = fl.id and"}
)
Note: Please refer the below AWS documentation for details
https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html
- Glue Studio - Custom Connector
Custom Connector based option can be used to leverage a SQL
https://docs.aws.amazon.com/glue/latest/ug/connectors-chapter.html
https://aws.amazon.com/blogs/big-data/developing-testing-and-deploying-custom-connectors-for-your-data-stores-with-aws-glue/