Questions tagged with AWS Glue

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Hello - I have a glue job that reads data from glue catalog table, and writes it back into s3 in Delta format. IAM role of the glue job has s3:PutObject, List, Describe and all other permissions needed to interact with s3 (read and write). However, I keep running into the error - > 2022-12-14 13:48:09,274 ERROR [Thread-9] output.FileOutputCommitter (FileOutputCommitter.java:setupJob(360)): Mkdirs failed to create glue-d-xxx-data-catalog-t-<dataset-name>-m-w://<s3-prefix>/_temporary/0 > 2022-12-14 13:48:13,875 WARN [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logWarning(73)): Lost task 5.0 in stage 1.0 (TID 6) (172.34.113.239 executor 2): java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: HG7ST1B44A6G30JC; S3 Extended Request ID: tR1CgoC1RcXZHEEZZ1DuDOvwIAmqC0+flXRd1ccdsY3C8PyjkEpS4wHDaosFoKpRskfH1Del/NA=; Proxy: null) This error does not appear when I open up s3 bucket access with wildcard(principal:*) in the s3 bucket permissions section. Job fails even if I change the principal section to the same role as Glue jobs are associated with. Now, my question is - is there a different identify that AWS Glue assumes to run the job. The IAM role associated with the job has all the permissions to interact with s3 but it throws above AccessDenied exception ( failed to create directory). However, job succeeds with wildcard(*) on s3 permissions. Just to add some more context - this error does not happen when I am using native glue constructs like dynamic frame, spark data frame to read, process and persist data into s3. It only happens with delta format. Below is the samplec code ``` src_dyf = glueContext.create_dynamic_frame.from_catalog(database="<db_name>", table_name="<>table_name_glue_catalog") dset_df = src_dyf.toDF() # dynamic frame to dta frame conversion # write data frame into s3 prefix in delta format. glueContext.write_data_frame_from_catalog( frame=dset_df, database="xxx_data_catalog", table_name="<tbale_name>", additional_options=additional_options #contains key-value pair of s3 path with key, path ) ```
1
answers
0
votes
40
views
asked a month ago
using ``` UNLOAD(SELECT * FROM TABLE) TO 's3://...' WITH (format = 'PARQUET',compression = 'SNAPPY'); ``` there is a column in table with is an array which comes back with`Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead`. How do I unload parquet if the column has arrays?
0
answers
0
votes
26
views
asked 2 months ago
I ran basic sql in Athena to view the catalog table which was created by the glue-crawler (crawler job ended successfully and created the "metadata" catalog table in the "hw-db" db) : `SELECT * FROM "AwsDataCatalog"."hw-db"."metadata" limit 10;` and got the following error: ``` HIVE_UNKNOWN_ERROR: com.amazonaws.services.lakeformation.model.InvalidInputException: Unsupported vendor for Glue supported principal: arn:aws:iam::{...}:root (Service: AWSLakeFormation; Status Code: 400; Error Code: InvalidInputException; Request ID: {...}; Proxy: null) This query ran against the "hw-db" database, unless qualified by the query. ``` any ideas?...
1
answers
0
votes
61
views
Erez
asked 2 months ago
I have a simple query in athena, select avg price, group by country. It is returning 4 rows, 1 for canada, 1 for usa, and then ***2 rows with dates in the country column***. This seems to be a bug. What can be done to understand what is going wrong? I am new to athena completely. Athena is querying a db in glue that has indexed the metadata of a csv file in s3 with its crawler. I have downloaded the csv from s3, and it looks fine.
Accepted AnswerAmazon AthenaAWS Glue
1
answers
0
votes
40
views
asked 2 months ago
Hi, We want to migrate our informatica jobs to AWS glue, is there any tool or process to migrate the jobs ?
1
answers
0
votes
27
views
asked 2 months ago
I am trying to use the AWS Glue 4.0 with interactive sessions using the following magic configuration. ``` %glue_version 4.0 ``` But this throws an error that only 2.0 and 3.0 are valid versions. Clearly this is not supported, but I wanted check if there is anything one needs to do to enable Glue 4.0 for interactive sessions?
1
answers
0
votes
26
views
RGCK
asked 2 months ago
Greetings, I have a really simple ETL that should take a csv from s3 and insert it into Redshift. However, I can't configure Redshift as a target because for some reason in the target properties the dropdown only shows Glue Data Catalog databases and not Redshift ones. I have tried different browsers thinking it was a caching issue, but am now convinced it's an AWS error. ![Enter image description here](/media/postImages/original/IMYaKDKhxNSo-WlE2oBiZWZQ)
1
answers
0
votes
28
views
asked 2 months ago
Hello everyone ! I am trying to import data from one RDS SQL Server to Redshift but I have a query with a recursive cte like that : ``` WITH RECURSIVE table(n) AS ( select 1 UNION ALL SELECT n+1 FROM table WHERE n < 100 ) SELECT n FROM table; ``` On Redshift there is no error but if I execute the query in an aws glue job (Transform - SQL Query) then i get this error : *ParseException: no viable alternative at input 'WITH RECURSIVE table* What am I missing ?
2
answers
0
votes
54
views
asked 2 months ago
We are receiving the java.io.CharConversionException when trying to read data from a DB2 database that has characters outside the UTF-8 encoding. I tried adding an option("encoding", "ISO-8859-1") and option("charset", "ISO-8859-1") but neither seems to have any effect at all. Is it possible to ask the glueContext to use a specific type of encoding? If not, what options do we have for handling characters that are throwing the CharConversionException? We have been excluding the rows specifically though SQL but this is not a tenable solution.
3
answers
0
votes
50
views
asked 2 months ago
Is OLE Automation supported in SQL Server RDS? If not, what are the alternatives? Thanks in advance.
1
answers
0
votes
54
views
KH
asked 2 months ago
I have been experimenting using Glue 4 which supports Python 3.10 and pandas. I am adding pandas as a zipped library through the `--extra-py-files` functionality for a `gluetl` job. When running my job, it fails importing pandas (version 1.4.3) (`import pandas as pd`) with the following which I copy-pasted from the cloudwatch logs: ``` 2022-12-06 16:49:09,450 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): Error from Python:Traceback (most recent call last): File "/tmp/database_monitoring.py", line 2, in <module> import pandas as pd File "/home/spark/.local/lib/python3.10/site-packages/pandas/__init__.py", line 48, in <module> from pandas.core.api import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/api.py", line 47, in <module> from pandas.core.groupby import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/groupby/__init__.py", line 1, in <module> from pandas.core.groupby.generic import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 76, in <module> from pandas.core.frame import DataFrame File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 170, in <module> from pandas.core.generic import NDFrame File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 147, in <module> from pandas.core.describe import describe_ndframe File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/describe.py", line 45, in <module> from pandas.io.formats.format import format_percentiles File "/home/spark/.local/lib/python3.10/site-packages/pandas/io/formats/format.py", line 105, in <module> from pandas.io.common import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/io/common.py", line 8, in <module> import bz2 File "/usr/local/lib/python3.10/bz2.py", line 17, in <module> from _bz2 import BZ2Compressor, BZ2Decompressor ModuleNotFoundError: No module named '_bz2' 2022-12-06 16:49:09,450 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): Error from Python:Traceback (most recent call last): File "/tmp/database_monitoring.py", line 2, in <module> import pandas as pd File "/home/spark/.local/lib/python3.10/site-packages/pandas/__init__.py", line 48, in <module> from pandas.core.api import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/api.py", line 47, in <module> from pandas.core.groupby import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/groupby/__init__.py", line 1, in <module> from pandas.core.groupby.generic import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 76, in <module> from pandas.core.frame import DataFrame File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 170, in <module> from pandas.core.generic import NDFrame File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 147, in <module> from pandas.core.describe import describe_ndframe File "/home/spark/.local/lib/python3.10/site-packages/pandas/core/describe.py", line 45, in <module> from pandas.io.formats.format import format_percentiles File "/home/spark/.local/lib/python3.10/site-packages/pandas/io/formats/format.py", line 105, in <module> from pandas.io.common import ( File "/home/spark/.local/lib/python3.10/site-packages/pandas/io/common.py", line 8, in <module> import bz2 File "/usr/local/lib/python3.10/bz2.py", line 17, in <module> from _bz2 import BZ2Compressor, BZ2Decompressor ModuleNotFoundError: No module named '_bz2' ``` I believe this is a bug in AWS Glue 4.0 as opposed to a user issue. Is anyone able to advise or confirm? And if so, is there a bug fix planned for this?
2
answers
0
votes
47
views
JDay
asked 2 months ago
I have an include path like this one: s3://my-datalake/projects/. In this project folder, I have these folders within - daily-2022-11-05, daily-2022-11-06, incremental_123456, and incremental_234567 Each of these files contains a parquet file. Now, when the crawler runs, I want it to exclude everything that starts with incremental_ in its name. I did try using `incremental_**/**`. This is working for one crawler and isn't working for the other one. What I meant by isn't working for the other one - when I run the crawler it isn't updating the table or is failing.
1
answers
0
votes
15
views
asked 2 months ago