Questions tagged with AWS Glue
Content language: English
Sort by most recent
Each supplier provides me with their product data in their own format, so I need to map them in order to import the data into Magento.
As the product data needs to be updated frequently, I'm wondering if AWS Glue (Databrew) would be a suitable service to map these different Excel files into the same unified Magento format. I would configure it once and once I get an updated file, I can just use the glue job again.
Would you recommend Glue (Databrew) for this or is it just for data analysis preparation?
Hi, I am trying to create Glue database and grant permissions on it in Lake Formation. I tried several ways and several IAM roles and policies based on the documentation but every time I get `Insufficient Lake Formation permission(s): Required Create Database on Catalog`. The code is pretty much straightforward and when I define permissions in Lake Formations I specify `ALL` value. Does anyone have an idea what is missing in the code?
```
class ExampleStack(Stack):
def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
super().__init__(scope, construct_id, **kwargs)
custom_data_bucket_arn = "arn:aws:s3:::transformed-v5"
s3_location = "s3://transformed-v5"
bucket_name = "transformed-v5"
glue_role = cdk.aws_iam.Role(self, "glue_role",
assumed_by=cdk.aws_iam.ServicePrincipal('glue.amazonaws.com'),
managed_policies= [
cdk.aws_iam.ManagedPolicy.from_managed_policy_arn(self, 'MyCrawlerGlueRole',
managed_policy_arn='arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole'
)
]
)
glue_role.add_to_policy(cdk.aws_iam.PolicyStatement(actions=['s3:*', 'lakeformation:GetDataAccess', "lakeformation:GrantPermissions"], effect=cdk.aws_iam.Effect.ALLOW, resources=['*']))
glue_role.add_to_policy(cdk.aws_iam.PolicyStatement(actions=["iam:PassRole"], effect=cdk.aws_iam.Effect.ALLOW, resources=['arn:aws:iam::xxxxxxx:role/{}'.format(glue_role.role_name)]))
glue_db=cdk.aws_glue.CfnDatabase(self, "MyDatabase",
catalog_id=cdk.Aws.ACCOUNT_ID,
database_input=cdk.aws_glue.CfnDatabase.DatabaseInputProperty(
name = "datalake-v5",
location_uri = s3_location
)
)
cdk.aws_glue.CfnCrawler(self, "datalake-crawler",
database_name= "datalake-v5",
role=glue_role.role_arn,
schedule={"scheduleExpression":"cron(0/15 * * * ? *)"},
targets={"s3Targets": [{"path": bucket_name}]},
)
location_resource = cdk.aws_lakeformation.CfnResource(self,
"MyDatalakeLocationResource",
resource_arn= custom_data_bucket_arn,
use_service_linked_role=True
)
cdk.aws_lakeformation.CfnPermissions(self, "MyDatalakeDatabasePermission",
data_lake_principal=cdk.aws_lakeformation.CfnPermissions.DataLakePrincipalProperty(data_lake_principal_identifier=glue_role.role_arn),
resource=cdk.aws_lakeformation.CfnPermissions.ResourceProperty(database_resource=cdk.aws_lakeformation.CfnPermissions.DatabaseResourceProperty(name="datalake-v5")),
permissions=["ALL"],
permissions_with_grant_option=["ALL"]
)
location_permission = cdk.aws_lakeformation.CfnPermissions(self, "MyDatalakeLocationPermission",
data_lake_principal=cdk.aws_lakeformation.CfnPermissions.DataLakePrincipalProperty(data_lake_principal_identifier=glue_role.role_arn),
resource=cdk.aws_lakeformation.CfnPermissions.ResourceProperty(data_location_resource=cdk.aws_lakeformation.CfnPermissions.DataLocationResourceProperty(s3_resource=custom_data_bucket_arn)),
permissions=["DATA_LOCATION_ACCESS"]
)
location_permission.add_dependency(location_resource)
```
Thank you!
Need AWS tech team's help here.
I've used my job's temporary path retrived by `getResolvedOptions` func as `staging_path` of `relationlize` function.
Found the job fails sometimes - means **NOT REGURARY** - because the job can't retrive the staged table after Relationalize function executed.
For your better understading, added some explanation & codes in below.
Pls advise me if any and kindly confirm back that we can keep using arguments get by `getResolvedOptions` func.
### [Code 1]
```
getResolvedOptions(sys.argv, [..., "TempDir", ...])
...
# the name of the target field to be relationalized is "params"
flatten_dyc = dyc["post_log"].relationalize(
root_table_name = 'root',
staging_path = args["TempDir"],
transformation_ctx = 'flatten_dyc'
)
flatten_dyc["root"].printSchema()
flatten_dyc["root_params"].printSchema()
```
This morning, I've ran it and got result as below.

`flatten_dyc["root_params"]` is empty despite it should have had `id` field at least to join with `flatten_dyc["root"]` table.
### [Code 2]
So I tried the same script with hard coded`staging_path`(pls refer to the blw) and found the job read staged tables - `flatten_dyc["root"]` - with all fields successfully.
```
...
flatten_dyc = dyc["post_log"].relationalize(
root_table_name = 'root',
staging_path = "s3://temp-glue-info/"
transformation_ctx = 'flatten_dyc'
)
flatten_dyc["root"].printSchema()
flatten_dyc["root_params"].printSchema()
```

### My question is:
1/ Why the function couldn't read the staged table properly when the path was soft-coded?
2/ **Moreover** , when I run [Code1] again, `flatten_dyc["root_params"]` was read successfully. Means the function is not realiable. Can you look into this?
I am trying to create a Glue job by executing CloudFormation template. Below are the IAM policies details:
- test_glue_role : (AmazonS3FullAccess,AWSCloudFormationFullAccess)
- test_cloudformation_role : (AWSGlueConsoleFullAccess,AmazonS3FullAccess)
template.yml file is uploaded to S3 bucket.
When I am trying to create the cloudformation stack, every time its throwing the below error:
Error:
User: arn:aws:sts::300800030007:assumed-role/test_cloudformation_role/AWSCloudFormation is not authorized to perform: iam:PassRole on resource: arn:aws:iam::300800030007:role/test_glue_role because no identity-based policy allows the iam:PassRole action (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: 883499f2-71ce-4c71-a6d4-6bfe49f23dc3; Proxy: null)
How to resolve this issue?
Hi,
I'm running my AWS Glue pyspark job on docker locally based on this description - https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container.
I'm getting such exception when I try to display data from glue dynamic frame: gdf.show(5)
23/02/06 11:20:09 WARN SimdCsvParser$: Could not find AWSGlueSimdNative lib from /home/glue_user/native_libs/amd64/hadoop-lzo-native:/home/glue_user/native_libs/amd64/hadoop-native/:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
java.lang.UnsatisfiedLinkError: no AWSGlueSimdNative in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
at java.lang.Runtime.loadLibrary0(Runtime.java:843)
at java.lang.System.loadLibrary(System.java:1136)
at com.amazonaws.services.glue.readers.csv.SimdCsvParser$.loadNativeLib(SimdCsvParser.scala:366)
at com.amazonaws.services.glue.readers.csv.SimdCsvParser$.isAllowed(SimdCsvParser.scala:361)
at com.amazonaws.services.glue.readers.CSVReader.init(CSVReader.scala:258)
at com.amazonaws.services.glue.hadoop.TapeHadoopColumnarReader.initialize(TapeHadoopRecordReader.scala:272)
Do you know if some native library is missing for image amazon/aws-glue-libs:glue_libs_4.0.0_image_01 ?
Regards,
Trying to use a UDF for converting a column to uppercase
UDF is in the S3 bucket .zip file which is accessed by the Glue job
py script of UDF for reference attached
ISSUE:
The UDF is not storing the data in the expected S3 bucket nor it is reading the data from the S3 bucket provided
```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
# Import the UDF code from Amazon S3
spark.sparkContext.addPyFile("s3://**********")
from mylibrary import udf_uppercase
# Create a Spark context
sc = SparkContext()
# Create a Glue context
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# Define the AWS Glue job
job = Job(glueContext)
job.init("uppercase_testUDF", args)
# Read data from S3
datasource0 = glueContext.create_dynamic_frame.from_options(
"s3",
{"paths": ["s3://***********"]},
format="json")
# Apply the UDF to the data
transformed_data = datasource0.toDF().withColumn("column_name", udf_uppercase(datasource0["column_name"]))
transformed_data.show()
Write the transformed data back to S3
glueContext.write_dynamic_frame.from_options(
frame=transformed_data,
connection_type="s3",
connection_options={
"path": "s3://*****************",
"format": "json"
},
transformation_ctx="test_uppercase_bookmark"
)
# Run the AWS Glue job
job.commit()
```
Hi, just wanted to know if aws glue bookmark columns can have null values - if null value is not supported, can anyone point to appropriate aws documentation for that?
Thank you!
I am trying to create a process that will ingest .cvs files dropped into an S3 bucket daily and add those records to an existing table in a serverless Redshift database. To accomplish this, I'm trying to build a Glue job that will scan the S3 bucket for new files and update/add those records to the redshift table accordingly. I have no Spark or Python skills so am using Glue Studio. I have selected the 'Visual with a source and target' option with my source being Amazon S3 and Target as Amazon Redshift. I click 'Create' and am taken to the canvas. I can find my S3 Bucket, apply the transform mappings but when I click the Redshift Cluster node, I do not see my serverless Redshift database. I do understand there is a difference between a Serverless Redshift database and a Redshift Cluster but there is no option to select the target as Serverless Redshift. My question seems to be similar to that posted in this question but the answer provided did not really help me: https://repost.aws/questions/QU33GP1YziR5OXIn-vehIxwA/aws-glue-studio-source-s-3-target-redshift-job-wants-to-select-gdc-not-redshift-table-in-target
I followed the steps outlined here https://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html (though this specifically applies to Redshift Clusters) to address the Inbound/outbound rules of the VPC security group in my Serverless Redshift workgroup and all look to be set up correctly. I also reviewed the associated IAM roles and they also look to be set up correctly.
My question is how do I set up a Glue job to read from an S3 bucket and populate a table in a Serverless Redshift database? What steps am I missing? I can't tell if I'm running into permission issues (i.e. Redshift won't allow Glue access) or if there is something else I'm missing. I cannot find any documentation that specifically addressed Glue and Serverless Redshift databases. I am open to any suggestions.
AWS Honeycode API service call from AWS Glue python code using boto3 is giving issue.
Sharing the code snippet below:
```
import boto3
honeycode_client = boto3.client('honeycode')
response = honeycode_client.list_tables(workbookId = 'eb59751e-ef06-4de0-a4a9-c355e49308ab')
```
NOTE: workbook and tables are already created in Honeycode. Role applied to glue job is having below policies:
AmazonS3FullAccess
```
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:*",
"s3-object-lambda:*"
],
"Resource": "*"
}
]
}
```
AmazonHoneycodeFullAccess
```
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"honeycode:*"
],
"Resource": "*",
"Effect": "Allow"
}
]
}
```
Above code snippet is giving below error:
**AccessDeniedException: An error occurred (AccessDeniedException) when calling the ListTables operation: Team is not associated to an AWS Account**
what extra access is required for above mentioned access issue?
Hello,
I am currently using Service Now table dumps exported as JSON array files to S3.
The S3 folder path structure is similar to the following:
```
aws s3 ls s3://bucket/data/
PRE data1-2022-09-22/
PRE data2-2022-09-22/
PRE data3-2022-09-22/
```
Each folder contains a series of JSON files in their own schema.
These files each contain an array of nested JSON objects, in the following structure:
```
[
{
"obj1": {
"display_value": "Change",
"value": "Change"
},
"obj2": {
"display_value": "Cancel all future Tasks",
"value": "cancel"
},
"obj3": {
"display_value": "2021-01-21 15:00:01",
"value": "2021-01-21 15:00:01"
}
},
{
"obj1": {
"display_value": "Change",
"value": "Change"
},
"obj2": {
"display_value": "Cancel all future Tasks",
"value": "cancel"
},
"obj3": {
"display_value": "2021-01-07 20:36:34",
"value": "2021-01-07 20:36:34"
}
},
{
"obj1": {
"display_value": null,
"value": ""
},
"obj2": {
"display_value": "Cancel all future Tasks",
"value": "cancel"
},
"obj3": {
"display_value": "2021-02-11 19:40:40",
"value": "2021-02-11 19:40:40"
}
}
]
```
As shown in the sample data provided, some of the values may be empty or null, but the overall the structure is the same.
When using a custom Glue JSON classifier, I am able to split the nested JSON objects into individual column names, where each column's data type is inferred as a struct, i.e.:
```
{
"obj3": {
"display_value": "string",
"value": "string"
}
}
```
Without using a custom Glue JSON classifier, the schema is inferred as a single column (named array), whose data type is an array containing all of the nested objects with elements and their types.
I note that using a custom classifier in Glue in this approach may not actually be the best way, and instead it may be preferrable to use no custom classifier, and then UNNEST the data from the array structure using an Athena query [1], using a CTAS to load it to S3.
Currently, I am seeking a way to unnest the JSON objects such that (for example above), the output would could show all of the nested values in individual rows (additional columns and values from sample data not shown):
```
obj1, obj2, obj3
{display_value="Change", value="Change"}, {display_value="Cancel all future Tasks", value="cancel"}, { display_value="2021-01-21 15:00:01", value="2021-01-21 15:00:01"}
{display_value="Change", value="Change"}, {display_value="Cancel all future Tasks", value="cancel"}, {display_value="2021-01-07 20:36:34", value="2021-01-07 20:36:34"}
{display_value=null, value=""}, {display_value="Cancel all future Tasks", value="cancel"}, {display_value="2021-02-11 19:40:40", value="2021-02-11 19:40:40"}
```
May I have some guidance in constructing such an athena query to give this kind of output for the sample data, using either with the custom glue classifier or without?
##### Sources
[1] https://docs.aws.amazon.com/athena/latest/ug/flattening-arrays.html
Hi,
We are using Glue to read data from on-premise Oracle DB table. The table is large and has 40+ columns. We are interested in data from only 3 columns. Does AWS Glue support reading from Oracle views? Has anyone tried and tested it? Please assist
I've tried to use Glue Spark Job for very basic partitioning over GZIP JSON data about 50GB.
The reason for trying Glue Job is my data could have more than 100 partitions and it is not really convinians to do it in Athena.
The code is almost from AWS template.
```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import concat, col, lit, substring
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
my_partition_predicate = "partition='main' and year='2022' and month='12' and day='17'"
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_db", table_name = "my_table", push_down_predicate = my_partition_predicate, transformation_ctx = "datasource0", additional_options = {"recurse": True})
datasource0 = datasource0.toDF()
datasource0.printSchema()
datasource0 = datasource0.withColumn("od", concat(substring('originalrequest',9,3), lit("_"), substring('originalrequest',12,3)))
datasource0 = DynamicFrame.fromDF(datasource0, glueContext, "datasource0")
datasink4 = glueContext.write_dynamic_frame.from_options(
frame = datasource0,
connection_type = "s3",
connection_options = {"path": "s3://my_b/glue4", "partitionKeys": ["od"] , "compression": "gzip"},
format = "json"
)
job.commit()
```
The job were executed with 60 DPUs and after 20 minutes it timed out... This failed execution is cost $8.8.
Meanwhile, totally the same job were done in Athena in about 2 minutes and cost $0.25.
Am I doing something wrong, or Athena (Presto) is leaps ahead of Spark in terms of speed and cost effectiveness?