Questions tagged with AWS Glue
Content language: English
Sort by most recent
Running a glue job to fetch records from Microsoft sql server but glue jobs keeps running and does not show any results. Job is scheduled with G.2X worker with 5 works with auto scheduling.
Logs:- 23/02/27 09:02:45 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I am trying to join two tables in CSV format(saved in S3 bucket).
Target is also an empty folder in S3.
Everytime getting the following error:
**AnalysisException: Cannot resolve column name "device_id" among ()**
where**** device_id**** is the unique id for joining the tables.
Please help
I have programmatically defined a eventbridge rule to send an event when a crawler completes.
response = event_client.put_rule(
Name="newmyrule",
EventPattern='{"detail-type": ["Glue Crawler State Change"],"source": ["aws.glue"],"detail": {"crawlerName":["'+crawler_name+'"],"state": ["Succeeded"]}}'
)
print("put_rule="+str(response))
put_target_response = event_client.put_targets(
Rule='newmyrule',
Targets=[{
'Id': 'mylambdafn',
'Arn': 'arn:aws:lambda:us-west-1:xxxxxxxxxxxxx:function:mylambdafn'
}]
)
enable_rule_response = event_client.enable_rule(Name='newmyrule')
I have also defined the crawler through boto3.
create_crawler_response =glue.create_crawler(
Name=crawler_name,
Role='arn:aws:iam::xxxxxxxxxx:role/ravi-glue-access',
DatabaseName='noah-ingest',
#TablePrefix="",
Targets={'S3Targets': [{'Path': s3_target}]},
SchemaChangePolicy={
'UpdateBehavior': 'UPDATE_IN_DATABASE',
'DeleteBehavior': 'DELETE_FROM_DATABASE'
}
)
It looks similar to rules defined through the console but results in failedinvocations. How do I fix this.
thanks,
Ravi.
I want to join two tables.I have the tables in CSV format stored in S3 bucket
1.Is Amazon Glue studio,the right option?
2.What is the correct procedure?
3.What are the IAM permissions required?
4.Where to see the joined table output?
Please throw some light
I have a working lab setup that has a glue job extract all data from a single dynamodb table to s3 in json format. This was done with the super simple setup using the AWS Glue Dynamo connector, all through the glue visual editor. I plan to run the job daily to refresh the data. The job is setup with Glue 3.0 & Python 3. Two questions:
1. I assume I need to purge/delete the s3 objects from the previous ETL job each night - how is this done within glue or do I need to handle it outside of glue?
2. I would like to update that job to limit the data sent to s3 to only include dynamodb records that have a specific key/value (status <> 'completed') so that I am not loading all of the dynamo data into my target. I dont care if the job has to get ALL of the dynamo table during extract and then filters it out during the transform phase, or if there is a way to selectively get data during the extract phase even better.
If anyone could advise with a simple example I would appreciate it. While I have looked for a little bit, I havent found much quality educational material, so happy to take any suggestions there as well (other than the AWS documentation - I have that, but need some initial direction/reference/101 hands on).
Hey I am running into this error during a data ingest job. The job has worked in the past with main different files but this one refuses to ingest.
I am curious if anyone has over come it The error seems like some sort of threading issue and cant write data.
Cloudwatch logs
```
2023-02-23 19:13:01,888 ERROR [shutdown-hook-0] util.Utils (Logging.scala:logError(94)): Uncaught exception in thread shutdown-hook-0
java.lang.ExceptionInInitializerError
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createAndTrack(TemporaryDirectoriesGenerator.java:125)
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createTemporaryDirectories(TemporaryDirectoriesGenerator.java:149)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:356)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1125)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1105)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:994)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2424)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2390)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2353)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.copyFromLocalFile(EmrFileSystem.java:568)
at com.amazonaws.services.glue.LogPusher.upload(LogPusher.scala:27)
at org.apache.spark.util.ShutdownHookManagerWrapper$.$anonfun$addLogPusherHook$2(ShutdownHookManagerWrapper.scala:9)
at org.apache.spark.util.ShutdownHookManagerWrapper$.$anonfun$addLogPusherHook$2$adapted(ShutdownHookManagerWrapper.scala:9)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.util.ShutdownHookManagerWrapper$.$anonfun$addLogPusherHook$1(ShutdownHookManagerWrapper.scala:9)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at scala.util.Try$.apply(Try.scala:209)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalStateException: Shutdown in progress
at java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:66)
at java.lang.Runtime.addShutdownHook(Runtime.java:203)
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoryShutdownHook.<clinit>(TemporaryDirectoryShutdownHook.java:18)
... 31 more
```
Hi,
I am using GlueETL version Spark 3.0 with Python version 
The ETL job has only 2 steps. I am using CodeGenConfiguration to auto-create the Spark script from my service backend.
```
"{\"sink-node-1\":{\"nodeId\":\"sink-node-1\",\"dataPreview\":false,\"previewAmount\":0,\"inputs\":[\"source-node-1\"],\"name\":\"organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e__id=62789a5b-78dd-4d41-ae96-1447674861a6__type=GDL\",\"generatedNodeName\":\"organization_id47e04d2824d347e79911bbfc071c754e__id62789a5b78dd4d41ae961447674861a6__typeGDL_sinknode1\",\"classification\":\"DataSink\",\"type\":\"S3\",\"streamingBatchInterval\":100,\"format\":\"parquet\",\"compression\":\"snappy\",\"path\":\"s3://x-bucket/event_etl_data/source_id=glueetl/schema_id=etl_raw_event/pipeline_id=fcf172f2-1cd1-4f9d-bdce-62b3b0c26696/organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e/model_name=test-sql-with-database-namez__version=None/\",\"partitionKeys\":[[\"year\"],[\"month\"],[\"day\"],[\"hour\"]],\"schemaChangePolicy\":{\"enableUpdateCatalog\":false,\"updateBehavior\":null,\"database\":null,\"table\":null},\"updateCatalogOptions\":\"none\",\"calculatedType\":\"\"},\"source-node-1\":{\"nodeId\":\"source-node-1\",\"dataPreview\":false,\"previewAmount\":0,\"inputs\":[],\"name\":\"organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e__id=9cf97c7b-ced8-4096-a7c3-2ca3560e0fd0__type=SNOWFLAKE\",\"generatedNodeName\":\"organization_id47e04d2824d347e79911bbfc071c754e__id9cf97c7bced84096a7c32ca3560e0fd0__typeSNOWFLAKE_sourcenode1\",\"classification\":\"DataSource\",\"type\":\"Connector\",\"isCatalog\":false,\"connectorName\":\"SNOWFLAKE\",\"connectionName\":\"organization_id=47e04d28-24d3-47e7-9911-bbfc071c754e__id=9cf97c7b-ced8-4096-a7c3-2ca3560e0fd0__type=SNOWFLAKE\",\"connectionType\":\"custom.jdbc\",\"outputSchemas\":[],\"connectionTable\":null,\"query\":\"SELECT \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"A\\\" AS \\\"inputs__A\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"B\\\" AS \\\"inputs__B\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"C\\\" AS \\\"outputs__C\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"ID\\\" AS \\\"feedback_id\\\", \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\".\\\"D\\\" AS \\\"timestamp\\\", year(SYSDATE()) AS \\\"year\\\", month(SYSDATE()) AS \\\"month\\\", day(SYSDATE()) AS \\\"day\\\", hour(SYSDATE()) AS \\\"hour\\\", SYSDATE() AS \\\"log_timestamp\\\" FROM \\\"ETL_DEMO\\\".\\\"PUBLIC\\\".\\\"EXAMPLE_TABLE\\\" ORDER BY \\\"D\\\"\",\"additionalOptions\":{\"filterPredicate\":\"\",\"partitionColumn\":null,\"lowerBound\":null,\"upperBound\":null,\"numPartitions\":null,\"jobBookmarkKeys\":[],\"jobBookmarkKeysSortOrder\":\"ASC\",\"dataTypeMapping\":{},\"filterPredicateArg\":[],\"dataTypeMappingArg\":[]},\"calculatedType\":\"\"}}"
```

As you can see, I am using the Snowflake JDBC connector, and simply using S3DirectTarget to write the parquet files to S3 destination. However, any NULL values of numeric columns in the source table ends up with 0.0, and there is no way for me to tell whether these are actual 0.0s or falsely converted 0.0s. Without modifying the PySpark script since my backend service is dependent on CodeGenConfiguration, is there a way to make sure the NULL values do not get falsely converted?
Thanks,
Kyle
Hi,
I built Iceberg table that uses Glue as the Hive catalog. Team members I work with want to connect to it using Spark. They run Spark locally on their laptop and want to read the table or they have Spark running locally in an Airflow Task on an EC2 and want to connect to it.
Is that possible to configure Spark not running on Glue or EMR to connect to Glue as the Hive Metastore? If so some examples would be appreciative.
We set this conf when running Iceberg "spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory".
Is this a JAR I can add to any Spark application that allows it to connect to AWS Glue as the Hive site or only works on EMR?
In my case,
When query about information_schema.columns in athena, Result does not include [Comment] column
Is there any update? or is it just temporary Error?
I am following what is mentioned in to launch spark history server locally and run spark ui but getting error on starting container
> https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html
Did anyone face same issue ? Please help
```
2023-02-22 17:54:07 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
2023-02-22 17:54:07 23/02/22 22:54:07 INFO HistoryServer: Started daemon with process name: 1@514d84090bb7
2023-02-22 17:54:07 23/02/22 22:54:07 INFO SignalUtils: Registering signal handler for TERM
2023-02-22 17:54:07 23/02/22 22:54:07 INFO SignalUtils: Registering signal handler for HUP
2023-02-22 17:54:07 23/02/22 22:54:07 INFO SignalUtils: Registering signal handler for INT
2023-02-22 17:54:07 23/02/22 22:54:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-02-22 17:54:08 23/02/22 22:54:08 INFO SecurityManager: Changing view acls to: root
2023-02-22 17:54:08 23/02/22 22:54:08 INFO SecurityManager: Changing modify acls to: root
2023-02-22 17:54:08 23/02/22 22:54:08 INFO SecurityManager: Changing view acls groups to:
2023-02-22 17:54:08 23/02/22 22:54:08 INFO SecurityManager: Changing modify acls groups to:
2023-02-22 17:54:08 23/02/22 22:54:08 INFO SecurityManager:** SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2023-02-22 17:54:08 23/02/22 22:54:08 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin **permissions:
2023-02-22 17:54:08 Exception in thread "main" java.lang.reflect.InvocationTargetException
2023-02-22 17:54:08 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2023-02-22 17:54:08 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
2023-02-22 17:54:08 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2023-02-22 17:54:08 at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
2023-02-22 17:54:08 at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:300)
2023-02-22 17:54:08 at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
2023-02-22 17:54:08 Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
2023-02-22 17:54:08 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
2023-02-22 17:54:08 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
2023-02-22 17:54:08 at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:116)
2023-02-22 17:54:08 at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:88)
2023-02-22 17:54:08 ... 6 more
```
I want to create EventBridge event to trigger Glue job, but when I create glue trigger there is no option for EventBridge (on legacy page it is but is blocked). I have CloudTrail enabled. Where is the problem? Is this option still available?
I am trying to delete entries from my Lake Formation Governed Table. I ran the commands via the SDK, and it all looked successful, but the linked Athena still sees the data that was supposedly deleted. Deleting the S3 resources after (since DeleteObject from the governed table doesn't adjust S3) now throws errors in Athena because the expected files are missing.
Is there something wrong with my process of deleting from Lake Formation Governed tables?