Questions tagged with AWS Glue

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

I'm following documentation from : 1. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-redshift-readwrite.html 2. https://github.com/spark-redshift-community/spark-redshift My code: ![Enter image description here](/media/postImages/original/IMEbE35B_QRJSB3km3UK5VcQ) Logs: ![Enter image description here](/media/postImages/original/IMvorRP4RaSIa2_4CYIGmOww) I am getting these timeout messages until job reaches it's timeout threshold and fails. Is that IP from log my internal Redshift Serverless address? Am I missing something? I would appreciate any help.
1
answers
0
votes
35
views
asked a month ago
where clause doesn't work in sql transform in glue. any help is appreciated.
1
answers
0
votes
31
views
asked a month ago
please tell me what is this error, I have created enough permissions for the role but still can't create crawler ![Enter image description here](/media/postImages/original/IMiIhVakfdRiSCnoS0-xwbkQ)
1
answers
0
votes
21
views
asked a month ago
Hi, How to get a notification in amazon cloud watch for failed Glue jobs in AWS.
1
answers
0
votes
13
views
asked a month ago
While running Glue I see these arguments passed to job: { 'job_bookmark_option': 'job-bookmark-disable', 'job_bookmark_from': None, 'job_bookmark_to': None, 'JOB_ID': 'j_c8afc16edb1420c2fb878249843e27280db60efcd37b4f6c7c469c4a55a1b5bd', 'JOB_RUN_ID': 'jr_d74caf9a56f744d09ac4d7fd076caa3d8da3cbc5d58f925ea532dc3c7dfcdf32', 'SECURITY_CONFIGURATION': None, 'encryption_type': None, 'enable_data_lineage': None, 'RedshiftTempDir': 's3://aws-glue-assets-myaccount-us-east-1/temporary/', 'TempDir': 's3://aws-glue-assets-myaccount-us-east-1/temporary/', 'JOB_NAME': 'my-job', } I spotted parameter called "enable_data_lineage" . For the next run I set this parameter to True, like this: "--enable-data-lineage true" in Job parameters section. After this my job startup time jumped from 7 seconds to 3 min and 10 seconds. I went to logs to check what's going on and I spotted error messages like this: 2022-12-30 12:27:54,873 WARN [Thread-12] lineage.LineagePersistence$ (LineagePersistence.scala:isCatalogLineageSettingEnabled(99)): Exception occurred while getting catalog lineage settings, lineage for this job run will be disabled com.amazonaws.services.lakeformation.model.InternalServiceException: Received an unexpected Content-Type: text/html', expected one of [application/json]. HTTP status code:503 (Service: AWSLakeFormation; Status Code: 500; Error Code: InternalServiceException; Request ID: ffe50f30-ec28-4648-814a-e267be0453da; Proxy: null) I tried to search for documentation, but no luck.. How to properly set-up this feature? Is there any examples?
1
answers
0
votes
31
views
asked a month ago
We run our ETLs using the below architecture to populate the datalake : MySQL -> DMS -> S3 -> Glue -> S3 Though this architecture works fine , its heavily dependent on the Database. Also the object data is scattered across multiple tables . An ETL based on object data could be another way to retain the object structure and extract information from . Below is what I am thinking : Application -> Kinesis Firehose -> S3 -> Glue ->S3 Has anyone tried this ? Any pros/ cons / architecture documentation would be helpful. Note : At this point we do not have any real time data requirement, but might need in future . Let me know if there is any other information required .
1
answers
0
votes
42
views
asked a month ago
Hello, By default Glue run one executor per worker. I want to run more executors in worker. I have set following spark configuration in Glue params but It didn't work. `--conf : spark.executor.instances=10` Let's say I have 5 G.2X workers. In that case It starts 4 executors because 1 will be reserved for driver. I can see list of all 4 executors in Spark UI. But above configuration does not increase executors at all. I'm getting following warning in driver logs. Seems like glue.ExecutorTaskManagement controlling number of executors. `WARN [allocator] glue.ExecutorTaskManagement (Logging.scala:logWarning(69)): executor task creation failed for executor 5, restarting within 15 secs. restart reason: Executor task resource limit has been temporarily hit` Any help would be appreciated. Thanks!
1
answers
0
votes
18
views
asked a month ago
Hi there, i'm just able to see two out of four columns in athena. I don't know the reason but could it be because of my glue schema version? I tried changing the precicision and scale number but it didn't work. ![Athena output](/media/postImages/original/IMc0QRhlHeRai1GfeLDITGPQ) ![Glue Schema](/media/postImages/original/IMwByedqIpQMqRteDHa0L6xw) ![SQL-Query](/media/postImages/original/IMtTq4UnkXRLCk5BsYFwN3yQ) ![CSV-File-originalData](/media/postImages/original/IMGQWkeeRTRQapGVZJyc7Y0g) ![expanded table](/media/postImages/original/IM0PzaGq-9Rrqyp8hJS16mEw) ![Glue Jobs ETL](/media/postImages/original/IMHmQqzLBnTgSzVai6L6056g) I hope someone is able to help me out. Thanks in advance! - Ellie
0
answers
0
votes
54
views
Ellie
asked a month ago
I am looking for sample glue streaming script that reads data from kafka and writes into s3 in iceberg format . I have gone through aws documentation , but did not find enough details to start with the work . I have to read data from kafka not AWS managed kafka (MSK ).
0
answers
0
votes
17
views
asked a month ago
I am facing some issues when I execute more or less 10-12 Glue DataBrew processes at the same time. This is the following error: Too Many Requests (Service: AWSGlueDataBrew; Status Code: 429; Error Code: TooManyRequestsException) I found that one of the possible solutions is to increase the number of maximum API calls and usage plans per API key. It is correct? Can this petition solve this issue?
1
answers
0
votes
30
views
Joel
asked a month ago
I would like to get data from IceBerg table using AWS Lambda. I was able to create all the code and containers only to discover that AWS Lambda doesn't allow process substitution that spark uses here: https://github.com/apache/spark/blob/121f9338cefbb1c800fabfea5152899a58176b00/bin/spark-class#L92 The error is: /usr/local/lib/python3.10/dist-packages/pyspark/bin/spark-class: line 92: /dev/fd/63: No such file or directory Do you maybe have some idea how this can be solved?
2
answers
0
votes
35
views
asked a month ago
Hello, everyone. We are experimenting some issues when trying to query some tables in AWS Athena. We have a bucket on S3 which we are feeding some parquets from a gold layer from our data lake. These tables are being crawled with AWS Glue. Although, when trying to query some of these tables, because of the presence of some parquets, we are receiving an error message like: GENERIC_INTERNAL_ERROR: io.trino.spi.type.DoubleType or GENERIC_INTERNAL_ERROR: io.trino.spi.type.VarcharType We are not being able to identify what is causing this issue, since the message is literally GENERIC. How could we solve this problem? We think there's something to do with the parquet schema, but not sure which one. If someone could help clarify this doubt, i'd be very glad. Thank you.
0
answers
0
votes
29
views
asked a month ago