Questions tagged with AWS Glue
Content language: English
Sort by most recent
HIVE_UNKNOWN_ERROR: Path is not absolute: s3://datapipeline-youtube-json-cleaned-data-23
This query ran against the "" database, unless qualified by the query.
The data is present in the s3 path but it is not getting queried
I have a glue job of type "Ray" that was deployed using CDK. I'm using the following parameters for the job run:
--enable-glue-datacatalog true
library-set analytics
--TempDir s3://{bucket}/temporary/
--additional-python-modules s3://{bucket}/{module}.zip
The job has a role which has access to the buckets for both TempDir and additional-python modules. When looking at the logs in cloudwatch, I can see that the job does everything it's supposed to do, but in the console, the job fails wit the error "Error while writing result to S3 working directory". I can't find any details in any of the log groups.
I'd like to run evaluations of my data quality rulesets on single partitions of my table rather than the whole table. This is because for most of my tables each partition effectively represents a snapshot of the data and running the checks only makes sense in the context of a single partition. Is there a way to filter or subset the data that a ruleset is evaluated on?
Preferably I'd like to do this when triggering the evaluation, but defining a restriction in the DQDL rules might also work.
We're experimenting glue 4.0 features and facing few issues. When contacted aws support for troubleshooting our issue few days ago, we were informed that it is still in preview and asked to switch to glue 3.0 if possible. Is this true?
I have a glue job where I'm creating a Dynamic frame from glue catalog. I am getting an intermittent error with o343.getDynamicFrame but if I rerun it, it will be success
Exception: job failed due to error - An error occurred while calling o343.getDynamicFrame.
command it fails on:
source_dynamic_df = glueContext.create_dynamic_frame.from_catalog(database = src_catalog_db, table_name = src_tbl_nm, push_down_predicate = partition_predicate, additional_options={"mergeSchema": "true"}, transformation_ctx = "source_dynamic_df")
Hi all,
I have followed the instructions https://docs.aws.amazon.com/athena/latest/ug/connect-data-source-serverless-app-repo.html to deploy Timestream as an additional data source to Athena and can succeassfully query timestream data via Athena console, by using catalog "TimestreamCatalog" I added.
Now I need to use the same catalog "TimestreamCatalog" when building a Glue job.
I run:
```
DataCatalogtable_node1 = glueContext.create_dynamic_frame.from_catalog(
catalog_id = "TimestreamCatalog",
database="mydb",
table_name="mytable",
transformation_ctx="DataCatalogtable_node1",
)
```
and run into this error, even when the role in question has Administrator policy i.e. action:* resource* attached (for the sake of experiment):
```
An error occurred while calling o86.getCatalogSource. User: arn:aws:sts::*******:assumed-role/AWSGlueServiceRole-andrei/GlueJobRunnerSession is not authorized to perform: glue:GetTable on resource: arn:aws:glue:eu-central-1:TimestreamCatalog:catalog (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: 36d7e411-8ca9-4993-9066-b6ca1d7ea4a3; Proxy: null)
```
When calling `aws athena list-data-catalogs `, I get:
```
{
"DataCatalogsSummary": [
{
"CatalogName": "AwsDataCatalog",
"Type": "GLUE"
},
{
"CatalogName": "TimestreamCatalog",
"Type": "LAMBDA"
}
]
}
```
I am not sure if using data source name as catalog_id is correct here, so any hint on what catalog_id is supposed to be for customer data source is appreciated, or any hint on how to resolve the issue above.
Thanks, Andrei
I've created a data validation box in my Glue ETL, which imports the following:
`from awsgluedq.transforms import EvaluateDataQuality`
To develop further my script, I've copied the script to a AWS Glue Notebook. But the line doesn't work, it throws the error:
`ModuleNotFoundError: No module named 'awsgluedq'`
I've tried to add it through the magic `%extra_py_files ['awsgluedq']` and `%additional_python_modules ['awsgluedq']` but it doesn't work either.
How can I import that module?
We're using the `GlueSchemaRegistryDeserializerDataParser` class from https://github.com/awslabs/aws-glue-schema-registry.
This seems to be from the v1 of the AWS SDK (or am I wrong?)
Is there a replacement in aws-sdk-java-v2 (https://github.com/aws/aws-sdk-java-v2)?
We're using the `GlueSchemaRegistryDeserializerDataParser` class from https://github.com/awslabs/aws-glue-schema-registry.
This seems to be from the v1 of the AWS SDK (or am I wrong?)
Is there a replacement in aws-sdk-java-v2 (https://github.com/aws/aws-sdk-java-v2)?
Here is my code snippet:
[transaction_DyF = glueContext.create_dynamic_frame.from_catalog(
database = source_db,
table_name = source_tbl,
push_down_predicate = pushDownPredicate,
transformation_ctx = source_tbl)
Error Message:
Error: An error occurred while calling o103.getDynamicFrame.\n: java.lang.ClassNotFoundException: Failed to find data source: UNKNOWN
I have creating glue job that is reading parquet file from s3 and using iceberg connector to create iceberg table . I have used catalog name as my_catalog , database I have created with name db and table name I have given is sampletable , though when I run the job it fails with below error:
**AnalysisException: The namespace in session catalog must have exactly one name part: my_catalog.db.sampletable**
I'm writing into redshift and realized Glue 4.0 is probably optimizing the column sizes.
Summary of error:
```
py4j.protocol.Py4JJavaError: An error occurred while calling o236.pyWriteDynamicFrame.
: java.sql.SQLException:
Error (code 1204) while loading data into Redshift: "String length exceeds DDL length"
Table name: "PUBLIC"."table_name"
Column name: column_a
Column type: varchar(256)
```
In previous glue versions, the string columns were always varchar(65535) but now, my tables are created with varchar(256), and writing into some columns fail due to this error. Now, will this occur with other data types? . How can I solve this within Glue 4.0?