I want to access an Amazon Simple Storage Service (Amazon S3) Requester Pays bucket from AWS Glue, Amazon EMR, or Amazon Athena.
Resolution
To access S3 buckets that have Requester Pays turned on, all requests to the bucket must have the Requester Pays header.
AWS Glue
AWS Glue requests to Amazon S3 don't include the Requester Pays header by default. Without the Requester Pays header, an API call to a Requester Pays bucket fails with an AccessDenied error.
To add the Requester Pays header to an ETL script, use hadoopConfiguration().set() to set fs.s3.useRequesterPaysHeader to true on the GlueContext variable or the Apache Spark session variable.
GlueContext:
glueContext._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")
Spark session:
spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")
The following is an example ETL script that includes the header:
import sys from awsglue.transforms
import * from awsglue.utils
import getResolvedOptions from pyspark.context
import SparkContext from awsglue.context
import GlueContext from awsglue.job
import Job from awsglue.dynamicframe
import DynamicFrame
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")
# glueContext._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")
##AWS Glue DynamicFrame read and write
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "your_database_name", table_name = "your_table_name", transformation_ctx = "datasource0")
datasource0.show()
datasink = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path":"s3://awsdoc-example-bucket/path-to-source-location/"}, format = "csv")
##Spark DataFrame read and write
df = spark.read.csv("s3://awsdoc-example-bucket/path-to-source-location/")
df.show()
df.write.csv("s3://awsdoc-example-bucket/path-to-target-location/")
job.commit()
Note: In the preceding script, replace the following values with your values:
- database_name with the name of your database
- your_table_name with the name of your table
- s3://awsdoc-example-bucket/path-to-source-location/ with the path to the source bucket
- s3://awsdoc-example-bucket/path-to-target-location/ with the path to the destination bucket
Amazon EMR
To add fs.s3.useRequesterPaysHeader for Amazon EMR, set the following property in /usr/share/aws/emr/emrfs/conf/emrfs-site.xml:
<property>
<name>fs.s3.useRequesterPaysHeader</name>
<value>true</value>
</property>
Athena
To allow workgroup members to query Requester Pays buckets, complete the following steps:
- Open the Athena console.
- In the navigation pane, choose Workgroups.
- Select your workgroup, and then choose Edit.
- In Settings, choose Turn on queries on requester pays buckets in Amazon S3. For more information, see Edit a workgroup.
Related information
Configuring Requester Pays on a bucket
Downloading objects from Requester Pays buckets
How do I troubleshoot 403 Access Denied errors from Amazon S3?