Questions tagged with AWS Glue

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Hi could anyone help on these? we are facing this when we migrating to glue catalog, it cannot show timestamp in the glue athena which cause this table unselectable. HIVE_INVALID_PARTITION_VALUE: Invalid partition value ‘2022-08-09 23%3A59%3A59’ for TIMESTAMP partition key: xxx_timestamp=2022-08-09 23%253A59%253A59
0
answers
0
votes
2
views
asked an hour ago
I converted a CSV(from S3) to parquet(to S3) using AWS glue and the file which is converted to Parquet was named randomly .How do i choose the name of the file that is to be converted to Parquet from CSV ? ![Enter image description here](/media/postImages/original/IMUQds6rTFS8i2Yv9YENybcQ) when i add data.parquet at the end of the s3 path (in target) without '/' ,AWS glues is creating a subfloder in the bucket with the name data.parquet instead of file name, where as the new file parquet file is created with the name like this "run-1678983665978-part-block-0-r-00000-snappy.parquet" where should i give a name to the parquet file ?
1
answers
0
votes
16
views
asked a day ago
I follow this blog to try the hudi connect: [Ingest streaming data to Apache Hudi tables using AWS Glue and Apache Hudi DeltaStreamer](https://aws.amazon.com/cn/blogs/big-data/ingest-streaming-data-to-apache-hudi-tables-using-aws-glue-and-apache-hudi-deltastreamer/). But when I started the glue job, I always got this error log: ``` 2023-03-28 12:39:33,136 - __main__ - INFO - Glue ETL Marketplace - Preparing layer url and gz file path to store layer 8de5b65bd171294b1e04e0df439f4ea11ce923b642eddf3b3d76d297bfd2670c. 2023-03-28 12:39:33,136 - __main__ - INFO - Glue ETL Marketplace - Getting the layer file 8de5b65bd171294b1e04e0df439f4ea11ce923b642eddf3b3d76d297bfd2670c and store it as gz. Traceback (most recent call last): File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/tmp/aws_glue_custom_connector_python/docker/unpack_docker_image.py", line 361, in <module> main() File "/tmp/aws_glue_custom_connector_python/docker/unpack_docker_image.py", line 351, in main res += download_jars_per_connection(conn, region, endpoint, proxy) File "/tmp/aws_glue_custom_connector_python/docker/unpack_docker_image.py", line 304, in download_jars_per_connection download_and_unpack_docker_layer(ecr_url, layer["digest"], dir_prefix, http_header) File "/tmp/aws_glue_custom_connector_python/docker/unpack_docker_image.py", line 168, in download_and_unpack_docker_layer layer = send_get_request(layer_url, header) File "/tmp/aws_glue_custom_connector_python/docker/unpack_docker_image.py", line 80, in send_get_request response.raise_for_status() File "/home/spark/.local/lib/python3.7/site-packages/requests/models.py", line 941, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://prod-us-east-1-starport-layer-bucket.s3.us-east-1.amazonaws.com/6a636e-709825985650-a6bdf6d5-eba8-e643-536c-26147c8be5f0/84e9f346-bf80-4532-ac33-b00f5dbfa546?X-Amz-Security-Token=....Ks4HlEAQcC0PUIFipDGrNhcEAVTZQ%3D%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20230328T123933Z&X-Amz-SignedHeaders=host&X-Amz-Expires=3600&X-Amz-Credential=%2F20230328%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=c28f35ab3b3c Glue ETL Marketplace - failed to download connector, activation script exited with code 1 LAUNCH ERROR | Glue ETL Marketplace - failed to download connector.Please refer logs for details. Exception in thread "main" java.lang.Exception: Glue ETL Marketplace - failed to download connector. at com.amazonaws.services.glue.PrepareLaunch.downloadConnectorJar(PrepareLaunch.scala:1043) at com.amazonaws.services.glue.PrepareLaunch.com$amazonaws$services$glue$PrepareLaunch$$prepareCmd(PrepareLaunch.scala:759) at com.amazonaws.services.glue.PrepareLaunch$.main(PrepareLaunch.scala:42) at com.amazonaws.services.glue.PrepareLaunch.main(PrepareLaunch.scala) ``` I guess the root cause is: 1. The Glue job cannot pull the connect image from AWS maketplace. 2. The connector image cannot store into the S3 bucket. So I try these methods: 1. Give permissions to the IAM role of the job. I give `AWSMarketplaceFullAccess, AmazonEC2ContainerRegistryFullAccess, AmazonS3FullAccess`, I think these permissions are enough definitely. 2. Make the S3 bucket public. I turned off the `Block public access` of the related S3 bucket. But even I did these, I still got the same error. Can someone give any suggestions?
Accepted AnswerAmazon EC2AWS Glue
1
answers
0
votes
10
views
donglai
asked a day ago
Hi all together, thanks to this article https://repost.aws/de/knowledge-center/query-glue-data-catalog-cross-account I know that with AWS EMR I can access the Glue Catalog from my current account and in addition by setting up the right permissions also the glue catalog from another account at the same time. My question is if this is also possible with an ETL Glue Job. I know that a cross Account Glue Catalog can be set up by using the --conf spark.hadoop.hive.metastore.glue.catalogid.. Parameter. But if I want to access tables from two other Accounts, I have a problem. Anyone any idea? Thanks for the help. Best
1
answers
0
votes
16
views
asked 2 days ago
I'm writing partitioned parquet data using a Spark data frame and mode=overwrite to update stale partitions. I have this set: spark.conf.set('spark.sql.sources.partitionOverwriteMode','dynamic') The data is being written correctly with all the partitioning being set correctly, but I am also getting empty files created at each level of the path named <path_level>_$folder$. Removing mode=overwrite eliminates this strange behavior. Is there any way to prevent these zero size files from being created? Have I misconfigured something?
1
answers
0
votes
14
views
asked 2 days ago
How can one set up an Execution Class = FLEX on a Jupyter Job Run , im using the %magic on my %%configure cell like below and also setting the input arguments with --execution_class = FLEX But still the jobs are quicking as STANDARD %%configure { "region": "us-east-1", "idle_timeout": "480", "glue_version": "3.0", "number_of_workers": 10, "execution_class": "FLEX", "worker_type": "G.1X" } ![Enter image description here](/media/postImages/original/IMgaPRfCicTAKewOu41SXTqw)
2
answers
0
votes
43
views
asked 4 days ago
I attempted to create a partition index on a table. The index failed to create with a backfill error. I see this by calling client.get_partition_indexes(). 'IndexStatus': 'FAILED', 'BackfillErrors': [{'Code': 'ENCRYPTED_PARTITION_ERROR'..... I cannot delete this failed index either through the console or via API client.delete_partition_index(). The attempt to delete returns: EntityNotFoundException: An error occurred (EntityNotFoundException) when calling the DeletePartitionIndex operation: Index with the given indexName : <index_name> does not exist. The failed index continues to be visible in the console. 2 questions: 1. How do I get rid of this failed index? 2. What encryption is causing the error? Is it the bucket being encrypted where the data is stored? Or is it catalog metadata encryption? Or other?
1
answers
0
votes
27
views
asked 4 days ago
Hello, I have two questions issues: 1. ``` spark = ( SparkSession.builder .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.glue_catalog.warehouse", f"s3://co-raw-sales-dev") .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") .enableHiveSupport() .getOrCreate() ) df.writeTo("glue_catalog.co_raw_sales_dev.new_test").using("iceberg").create() ``` CREATED TABLE DDL: ``` CREATE TABLE co_raw_sales_dev.new_test ( id bigint, name string, points bigint) LOCATION 's3://co-raw-sales-dev**//**new_test' TBLPROPERTIES ( 'table_type'='iceberg' ); ``` The problem I am having is that there is double // in location between bucket and table name in s3. 2. This one wokrs: ``` df.writeTo("glue_catalog.co_raw_sales_dev.new_test2").using("iceberg").create()``` but if I remove "glue_catalog" like: ```df.writeTo("co_raw_sales_dev.new_test2").using("iceberg").create()``` I am getting error : An error occurred while calling o339.create. Table implementation does not support writes: co_raw_sales_dev.new_test2 am I missing some parameter in SparkSession config? Thank you, Adas.
1
answers
0
votes
22
views
asked 5 days ago
I set up AdministratorAccess for my role, this is a master level policy for this role to pass all the services, specially is AWS Glue, I want to create crawler for build etl pipeline and pour data to database in catalog of AWS Glue, but I stuck in the error 400 denied access. I tried many way like: - Change the credit card, set default on it - Add permission many times, still failed.
1
answers
0
votes
21
views
asked 6 days ago
I have cluster A and cluster B. Cluster A has an external schema called 'landing_external' that contains many tables from our glue data catalog. Cluster A also has a local schema that is comprised of views that leverage data from 'landing_external' - this schema is called 'landing'. Cluster A has a datashare that Cluster B is the consumer of. The 'landing' schema is shared with Cluster B, however, anytime a user attempts to select data from any of the views in 'landing' schema, they receive an error `ERROR: permission denied for schema landing_external`. I thought that creating all of the views with option 'WITH NO SCHEMA BINDING' would address this permission gap but it does not. Any ideas on what I am missing?
3
answers
0
votes
33
views
tjtoll
asked 6 days ago
I am trying to create an ETL where I need to bring in data from redshift tables but the dataset is too large and I need to filter it before applying transformations on it . Glue filter node and the SQL query option does not filter data according to the requirement . The job keeps running for a long time and then fails , possibly due to the size of data . It seems that Glue is brining in all the data and then tries to apply the filter but before the filter is applied , the job fails . Is there a way to only bring in filtered data from redshift and then apply transformation on it ?
1
answers
0
votes
24
views
aneeq10
asked 7 days ago
Hi There, I was adding a VPC Network connection to an AWS Glue job and I got this error: JobRunId:jr_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx failed to execute with exception Could not find connection for the given criteria Failed to get catalog connections given names: xxx-xxxxx-xxxx,none Service: AWSGlueJobExecutor; Status Code: 400; Error Code: InvalidInputException; Request ID: xxxxxxx-xxxxx-xxxxx-xxxxxxxxx; Proxy: null I checked my VPC connecton and it all looked fine with correct security. Eventually I realized that I had also added the "None" Connection to the job EG ![Enter image description here](/media/postImages/original/IMXWnSSNrzRNyltg8WKE_f1Q) Surely the "None" connection should be ignored or else shouldn't be selectable. Thanks John
1
answers
0
votes
18
views
asked 8 days ago