Questions tagged with AWS Glue

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

I wrote a Python Shell job on AWS glue and it is throwing "Out of Memory Error". I have added print() function to view the outputs in the Cloudwatch logs of the lines that are successfully executed but I cannot see the outputs on CloudWatch - neither in error logs nor in output logs. Additionally, I am also not able to see which line's execution is causing this "Out of Memory Error" even in the error logs. The only logs I can see is that of Installation of Python modules. I have tried running the Glue Job multiple times but haven't been able to see the things that I mentioned above anytime. Can someone help me out here?
1
answers
0
votes
11
views
akc_adi
asked 9 days ago
When I created a crawler to crawl an RDS (Postgres), it was able to connect and crawl one table I specified. When I created a job, using the node type "AWS Glue Data Catalog table with PostgreSQL as the data target" and pointing to the database and table, it won't connect to the target. It is giving me the "An error occurred while calling o145.pyWriteDynamicFrame. The connection attempt failed." I've checked the security group and subnet of the RDS and the connection in Glue. What else should I be checking?
2
answers
0
votes
40
views
asked 10 days ago
Hello All, We need to small POC . In this we need to pick data from salesforce and push to Azure datalake using Glue . Can we connect to Azure datalake from Glue .
3
answers
0
votes
40
views
Purnima
asked 10 days ago
I configure the glue job according to the [official document](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html). But it always throw error as shown below when running. ``` 23/01/18 10:38:24 ERROR ProcessLauncher: Error from Python:Traceback (most recent call last): File "/tmp/test_job.py", line 16, in <module> AWSGlueDataCatalog_node1674017752048 = glueContext.create_dynamic_frame.from_catalog( File "/opt/amazon/lib/python3.7/site-packages/awsglue/dynamicframe.py", line 629, in from_catalog return self._glue_context.create_dynamic_frame_from_catalog(db, table_name, redshift_tmp_dir, transformation_ctx, push_down_predicate, additional_options, catalog_id, **kwargs) File "/opt/amazon/lib/python3.7/site-packages/awsglue/context.py", line 188, in create_dynamic_frame_from_catalog return source.getFrame(**kwargs) File "/opt/amazon/lib/python3.7/site-packages/awsglue/data_source.py", line 36, in getFrame jframe = self._jsource.getDynamicFrame() File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ return_value = get_return_value( File "/opt/amazon/lib/python3.7/site-packages/pyspark/sql/utils.py", line 190, in deco return f(*a, **kw) File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o101.getDynamicFrame. : java.lang.Exception: Unsupported dataframe format for job bookmarks at org.apache.spark.sql.wrapper.SparkSqlDecoratorDataSource.resolveRelation(SparkSqlDecoratorDataSource.scala:103) at com.amazonaws.services.glue.SparkSQLDataSource.$anonfun$getDynamicFrame$24(DataSource.scala:794) at com.amazonaws.services.glue.util.FileSchemeWrapper.$anonfun$executeWithQualifiedScheme$1(FileSchemeWrapper.scala:90) at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWith(FileSchemeWrapper.scala:83) at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWithQualifiedScheme(FileSchemeWrapper.scala:90) at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:762) at com.amazonaws.services.glue.DataSource.getDynamicFrame(DataSource.scala:102) at com.amazonaws.services.glue.DataSource.getDynamicFrame$(DataSource.scala:102) at com.amazonaws.services.glue.AbstractSparkSQLDataSource.getDynamicFrame(DataSource.scala:726) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:750) ``` Script of glue job: ``` import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job args = getResolvedOptions(sys.argv, ["JOB_NAME"]) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args["JOB_NAME"], args) # Script generated for node AWS Glue Data Catalog AWSGlueDataCatalog_node1674017752048 = glueContext.create_dynamic_frame.from_catalog( database="source_db", table_name="source_table", transformation_ctx="AWSGlueDataCatalog_node1674017752048", ) # Script generated for node ApplyMapping ApplyMapping_node2 = ApplyMapping.apply( frame=AWSGlueDataCatalog_node1674017752048, mappings=[ ("time", "timestamp", "time", "timestamp"), ("name", "string", "name", "string"), ], transformation_ctx="ApplyMapping_node2", ) # Script generated for node MySQL table MySQLtable_node3 = glueContext.write_dynamic_frame.from_catalog( frame=ApplyMapping_node2, database="target_db", table_name="target_table", transformation_ctx="MySQLtable_node3", ) job.commit() ``` Source table definition: ``` CREATE TABLE source_db.source_table ( time timestamp, name string) PARTITIONED BY (`name`) LOCATION 's3://source_db/source_table' TBLPROPERTIES ( 'table_type'='iceberg' ); ```
0
answers
0
votes
11
views
asked 10 days ago
We have a requirement to sync the data from the on-prem database to AWS RDS (PotgreSQL) at specific intervals (unlike one-time data migration). Assume there is an Interconnect/VPN already established between AWS and On-prem network. The expected data volume is likely 1000 rows only, so I do not see the necessity to build ETL with AWS Glue. Given that, what are the possible solution options to fetch the data? Can AWS Batch/a pg_cron job be considered here to execute a set of select and update SQLs? Alternatively, how do we trigger the AWS Lambda at certain intervals if AWS Lambda is a solution option for this requirement? Appreciate your input.
1
answers
0
votes
24
views
asked 10 days ago
Hello Team, is there a limit to the number of tables which can be scanned using the Glue Crawler? I have a crawler which scans S3 buckets from a single source for data from January 2021 until December 2022. I have partitions for year and month. The crawler is not updating the data for November and December 2022. I am using this data to query in Athena and eventually in QuickSight. Can anyone suggest what could be wrong?
2
answers
0
votes
34
views
asked 10 days ago
HIVE_UNKNOWN_ERROR: Path is not absolute: s3://datapipeline-youtube-json-cleaned-data-23 This query ran against the "" database, unless qualified by the query. The data is present in the s3 path but it is not getting queried
2
answers
0
votes
40
views
asked 11 days ago
I have a glue job of type "Ray" that was deployed using CDK. I'm using the following parameters for the job run: --enable-glue-datacatalog true library-set analytics --TempDir s3://{bucket}/temporary/ --additional-python-modules s3://{bucket}/{module}.zip The job has a role which has access to the buckets for both TempDir and additional-python modules. When looking at the logs in cloudwatch, I can see that the job does everything it's supposed to do, but in the console, the job fails wit the error "Error while writing result to S3 working directory". I can't find any details in any of the log groups.
2
answers
0
votes
15
views
asked 12 days ago
I'd like to run evaluations of my data quality rulesets on single partitions of my table rather than the whole table. This is because for most of my tables each partition effectively represents a snapshot of the data and running the checks only makes sense in the context of a single partition. Is there a way to filter or subset the data that a ruleset is evaluated on? Preferably I'd like to do this when triggering the evaluation, but defining a restriction in the DQDL rules might also work.
0
answers
0
votes
23
views
asked 12 days ago
We're experimenting glue 4.0 features and facing few issues. When contacted aws support for troubleshooting our issue few days ago, we were informed that it is still in preview and asked to switch to glue 3.0 if possible. Is this true?
1
answers
0
votes
25
views
asked 12 days ago
I have a glue job where I'm creating a Dynamic frame from glue catalog. I am getting an intermittent error with o343.getDynamicFrame but if I rerun it, it will be success Exception: job failed due to error - An error occurred while calling o343.getDynamicFrame. command it fails on: source_dynamic_df = glueContext.create_dynamic_frame.from_catalog(database = src_catalog_db, table_name = src_tbl_nm, push_down_predicate = partition_predicate, additional_options={"mergeSchema": "true"}, transformation_ctx = "source_dynamic_df")
1
answers
0
votes
15
views
Amanda
asked 13 days ago
Hi all, I have followed the instructions https://docs.aws.amazon.com/athena/latest/ug/connect-data-source-serverless-app-repo.html to deploy Timestream as an additional data source to Athena and can succeassfully query timestream data via Athena console, by using catalog "TimestreamCatalog" I added. Now I need to use the same catalog "TimestreamCatalog" when building a Glue job. I run: ``` DataCatalogtable_node1 = glueContext.create_dynamic_frame.from_catalog( catalog_id = "TimestreamCatalog", database="mydb", table_name="mytable", transformation_ctx="DataCatalogtable_node1", ) ``` and run into this error, even when the role in question has Administrator policy i.e. action:* resource* attached (for the sake of experiment): ``` An error occurred while calling o86.getCatalogSource. User: arn:aws:sts::*******:assumed-role/AWSGlueServiceRole-andrei/GlueJobRunnerSession is not authorized to perform: glue:GetTable on resource: arn:aws:glue:eu-central-1:TimestreamCatalog:catalog (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: 36d7e411-8ca9-4993-9066-b6ca1d7ea4a3; Proxy: null) ``` When calling `aws athena list-data-catalogs `, I get: ``` { "DataCatalogsSummary": [ { "CatalogName": "AwsDataCatalog", "Type": "GLUE" }, { "CatalogName": "TimestreamCatalog", "Type": "LAMBDA" } ] } ``` I am not sure if using data source name as catalog_id is correct here, so any hint on what catalog_id is supposed to be for customer data source is appreciated, or any hint on how to resolve the issue above. Thanks, Andrei
1
answers
0
votes
26
views
asked 13 days ago