Glue ETL: cross region write to Iceberg failing

0

I have an Glue ETL job that runs in one region (eu-central-1) and successfully reads source data from an S3 bucket in a different region (us-east-1). I would like to write the output of the transformations I am performing to an S3 location in the same region as the source (us-east-1).

This works for the following formats:

  • Parquet
  • Hudi
  • Delta

It does not work for Iceberg, which is the format I require.

Here is my code (created through the visual editor):

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Script generated for node AWS Glue Data Catalog
AWSGlueDataCatalog_node1711553291421 = glueContext.create_dynamic_frame.from_catalog(database="raw", table_name="us_source", transformation_ctx="AWSGlueDataCatalog_node1711553291421")

# Script generated for node Amazon S3
additional_options = {"write.parquet.compression-codec": "gzip"}
tables_collection = spark.catalog.listTables("bronze")
table_names_in_db = [table.name for table in tables_collection]
table_exists = "sandbox_us_iceberg" in table_names_in_db
if table_exists:
    AmazonS3_node1711553556115_df = AWSGlueDataCatalog_node1711553291421.toDF()
    AmazonS3_node1711553556115_df        .writeTo("glue_catalog.bronze.sandbox_us_iceberg") \
        .tableProperty("format-version", "2") \
        .tableProperty("location", "s3://datalake-bronze/sensitive_data/sandbox_us/bronze/sandbox_us_iceberg") \
        .options(**additional_options) \
.append()
else:
    AmazonS3_node1711553556115_df = AWSGlueDataCatalog_node1711553291421.toDF()
    AmazonS3_node1711553556115_df        .writeTo("glue_catalog.bronze.sandbox_us_iceberg") \
        .tableProperty("format-version", "2") \
        .tableProperty("location", "s3://datalake-bronze/sensitive_data/sandbox_us/bronze/sandbox_us_iceberg") \
        .options(**additional_options) \
.create()

job.commit()

The error message I receive is: Error Category: UNCLASSIFIED_ERROR; Failed Line Number: 36; An error occurred while calling o143.create. Writing job aborted. Note: This run was executed with Flex execution. Check the logs if run failed due to executor termination. which is the line to create the Iceberg table.

The bottom of the stack trace provided looks like this:

	Suppressed: software.amazon.awssdk.services.s3.model.S3Exception: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint. (Service: S3, Status Code: 301, Request ID: HT0GMV7SHAZS65SA, Extended Request ID: pwBirv0jZiIDZCakLOufsj4M/dySVQHm8Vbl5zHu4R136Kw+c0EAvGYfiUSKBlDIbeWXq2SatSA=)
		at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:156)
		at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:108)
		at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:85)
		at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:43)
		at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler$Crc32ValidationResponseHandler.handle(AwsSyncClientHandler.java:95)
		at software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$7(BaseClientHandler.java:245)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:40)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:30)
		at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:73)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:50)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:36)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
		at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
		at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
		at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:48)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:31)
		at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
		at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
		at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
		at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:193)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:167)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:82)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:175)
		at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:76)
		at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
		at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:56)
		at software.amazon.awssdk.services.s3.DefaultS3Client.deleteObject(DefaultS3Client.java:2532)
		at org.apache.iceberg.aws.s3.S3FileIO.deleteFile(S3FileIO.java:153)
		at org.apache.iceberg.spark.source.SparkWrite.lambda$deleteFiles$2(SparkWrite.java:681)
		at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:402)
		at org.apache.iceberg.util.Tasks$Builder.access$300(Tasks.java:68)
		at org.apache.iceberg.util.Tasks$Builder$1.run(Tasks.java:308)
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
		at java.util.concurrent.FutureTask.run(FutureTask.java:266)
		... 3 more
Caused by: java.io.IOException: Unable to read response payload, because service returned response code 301 to an Expect: 100-continue request. Using another HTTP client implementation (e.g. Apache) removes this limitation.
	... 64 more
Caused by: java.io.UncheckedIOException: java.net.ProtocolException: Server rejected operation
	at software.amazon.awssdk.utils.FunctionalUtils.asRuntimeException(FunctionalUtils.java:180)
	at software.amazon.awssdk.utils.FunctionalUtils.lambda$safeSupplier$4(FunctionalUtils.java:110)
	at software.amazon.awssdk.utils.FunctionalUtils.invokeSafely(FunctionalUtils.java:136)
	at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.lambda$tryGetInputStream$1(UrlConnectionHttpClient.java:315)
	at software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient$RequestCallable.getAndHandle100Bug(UrlConnectionHttpClient.java:345)
	... 63 more
Caused by: java.net.ProtocolException: Server rejected operation
	at sun.net.www.protocol.http.HttpURLConnection.expect100Continue(HttpURLConnection.java:1277)
	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1356)
	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1317)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1526)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
	at software.amazon.awssdk.utils.FunctionalUtils.lambda$safeSupplier$4(FunctionalUtils.java:108)
	... 66 more

So it looks like some sort of permission error, but I am unsure how to fix because, as I mentioned, other file formats are writing fine to this S3 location. Could it be an issue with the library used to write the Iceberg table as it mentions using a different HTTP client implementation (e.g. Apache) removes this limitation?

Any help would be appreciated to solve this issue. We had a hard requirement to use Iceberg as the format so we can update records through Athena SQL. We also have a goal to run all our ETL jobs in eu-central-1.

Vince
asked a month ago129 views
1 Answer
1
Accepted Answer

That is because Iceberg is using S3 directly for some operations, instead of via Hadoop like the other formats do.
Try setting client.assume-role.region (even though you are not assuming, according to the docs, it should work), see: https://iceberg.apache.org/docs/latest/aws/#cross-account-and-cross-region-access

profile pictureAWS
EXPERT
answered 25 days ago
AWS
SUPPORT ENGINEER
reviewed 16 days ago
  • You are a wizard! Thank you!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions