Questions tagged with AWS Glue

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

HIVE_UNKNOWN_ERROR: Path is not absolute: s3://datapipeline-youtube-json-cleaned-data-23 This query ran against the "" database, unless qualified by the query. The data is present in the s3 path but it is not getting queried
2
answers
0
votes
65
views
asked 2 months ago
I have a glue job of type "Ray" that was deployed using CDK. I'm using the following parameters for the job run: --enable-glue-datacatalog true library-set analytics --TempDir s3://{bucket}/temporary/ --additional-python-modules s3://{bucket}/{module}.zip The job has a role which has access to the buckets for both TempDir and additional-python modules. When looking at the logs in cloudwatch, I can see that the job does everything it's supposed to do, but in the console, the job fails wit the error "Error while writing result to S3 working directory". I can't find any details in any of the log groups.
2
answers
0
votes
25
views
asked 2 months ago
I'd like to run evaluations of my data quality rulesets on single partitions of my table rather than the whole table. This is because for most of my tables each partition effectively represents a snapshot of the data and running the checks only makes sense in the context of a single partition. Is there a way to filter or subset the data that a ruleset is evaluated on? Preferably I'd like to do this when triggering the evaluation, but defining a restriction in the DQDL rules might also work.
1
answers
0
votes
38
views
asked 2 months ago
We're experimenting glue 4.0 features and facing few issues. When contacted aws support for troubleshooting our issue few days ago, we were informed that it is still in preview and asked to switch to glue 3.0 if possible. Is this true?
1
answers
0
votes
59
views
asked 2 months ago
I have a glue job where I'm creating a Dynamic frame from glue catalog. I am getting an intermittent error with o343.getDynamicFrame but if I rerun it, it will be success Exception: job failed due to error - An error occurred while calling o343.getDynamicFrame. command it fails on: source_dynamic_df = glueContext.create_dynamic_frame.from_catalog(database = src_catalog_db, table_name = src_tbl_nm, push_down_predicate = partition_predicate, additional_options={"mergeSchema": "true"}, transformation_ctx = "source_dynamic_df")
1
answers
0
votes
25
views
Amanda
asked 2 months ago
Hi all, I have followed the instructions https://docs.aws.amazon.com/athena/latest/ug/connect-data-source-serverless-app-repo.html to deploy Timestream as an additional data source to Athena and can succeassfully query timestream data via Athena console, by using catalog "TimestreamCatalog" I added. Now I need to use the same catalog "TimestreamCatalog" when building a Glue job. I run: ``` DataCatalogtable_node1 = glueContext.create_dynamic_frame.from_catalog( catalog_id = "TimestreamCatalog", database="mydb", table_name="mytable", transformation_ctx="DataCatalogtable_node1", ) ``` and run into this error, even when the role in question has Administrator policy i.e. action:* resource* attached (for the sake of experiment): ``` An error occurred while calling o86.getCatalogSource. User: arn:aws:sts::*******:assumed-role/AWSGlueServiceRole-andrei/GlueJobRunnerSession is not authorized to perform: glue:GetTable on resource: arn:aws:glue:eu-central-1:TimestreamCatalog:catalog (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: 36d7e411-8ca9-4993-9066-b6ca1d7ea4a3; Proxy: null) ``` When calling `aws athena list-data-catalogs `, I get: ``` { "DataCatalogsSummary": [ { "CatalogName": "AwsDataCatalog", "Type": "GLUE" }, { "CatalogName": "TimestreamCatalog", "Type": "LAMBDA" } ] } ``` I am not sure if using data source name as catalog_id is correct here, so any hint on what catalog_id is supposed to be for customer data source is appreciated, or any hint on how to resolve the issue above. Thanks, Andrei
1
answers
0
votes
46
views
asked 2 months ago
I've created a data validation box in my Glue ETL, which imports the following: `from awsgluedq.transforms import EvaluateDataQuality` To develop further my script, I've copied the script to a AWS Glue Notebook. But the line doesn't work, it throws the error: `ModuleNotFoundError: No module named 'awsgluedq'` I've tried to add it through the magic `%extra_py_files ['awsgluedq']` and `%additional_python_modules ['awsgluedq']` but it doesn't work either. How can I import that module?
1
answers
0
votes
58
views
Someone
asked 3 months ago
We're using the `GlueSchemaRegistryDeserializerDataParser` class from https://github.com/awslabs/aws-glue-schema-registry. This seems to be from the v1 of the AWS SDK (or am I wrong?) Is there a replacement in aws-sdk-java-v2 (https://github.com/aws/aws-sdk-java-v2)?
0
answers
0
votes
31
views
Jules
asked 3 months ago
We're using the `GlueSchemaRegistryDeserializerDataParser` class from https://github.com/awslabs/aws-glue-schema-registry. This seems to be from the v1 of the AWS SDK (or am I wrong?) Is there a replacement in aws-sdk-java-v2 (https://github.com/aws/aws-sdk-java-v2)?
0
answers
0
votes
13
views
Jules
asked 3 months ago
Here is my code snippet: [transaction_DyF = glueContext.create_dynamic_frame.from_catalog( database = source_db, table_name = source_tbl, push_down_predicate = pushDownPredicate, transformation_ctx = source_tbl) Error Message: Error: An error occurred while calling o103.getDynamicFrame.\n: java.lang.ClassNotFoundException: Failed to find data source: UNKNOWN
1
answers
0
votes
77
views
mrjimi
asked 3 months ago
I have creating glue job that is reading parquet file from s3 and using iceberg connector to create iceberg table . I have used catalog name as my_catalog , database I have created with name db and table name I have given is sampletable , though when I run the job it fails with below error: **AnalysisException: The namespace in session catalog must have exactly one name part: my_catalog.db.sampletable**
1
answers
0
votes
70
views
asked 3 months ago
I'm writing into redshift and realized Glue 4.0 is probably optimizing the column sizes. Summary of error: ``` py4j.protocol.Py4JJavaError: An error occurred while calling o236.pyWriteDynamicFrame. : java.sql.SQLException: Error (code 1204) while loading data into Redshift: "String length exceeds DDL length" Table name: "PUBLIC"."table_name" Column name: column_a Column type: varchar(256) ``` In previous glue versions, the string columns were always varchar(65535) but now, my tables are created with varchar(256), and writing into some columns fail due to this error. Now, will this occur with other data types? . How can I solve this within Glue 4.0?
1
answers
0
votes
65
views
asked 3 months ago