Questions tagged with AWS Glue

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Hey Guys! I am trying to Read a large amout of data(About 45GB in 5.500.000 files) in S3 and rewrite in a partitioned folder (In another Folder inside the same Bucket) but I am facing this error: Exception in User Class: com.amazonaws.SdkClientException : Unable to execute HTTP request: readHandshakeRecord When I tried with just one file in the same folder it works. do you have any Ideia what could be the Problem? Code(running using 60 DPUs, Glue 4.0): ``` import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.{DynamicFrame, GlueContext} import org.apache.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val dynamicFrame = glueContext.getSourceWithFormat( connectionType="s3", format="parquet", options=JsonOptions("""{"paths": ["s3://bucket/raw-folder"],"recurse": true, "groupFiles": "inPartition", "useS3ListImplementation": true}""") ).getDynamicFrame() glueContext.getSinkWithFormat( connectionType="s3", options=JsonOptions("""{"path": "s3://bucket/partition-folder"}"""), format="parquet", formatOptions=JsonOptions("""{"compression": "snappy","blockSize": 268435456, "pageSize": 1048576, "useGlueParquetWriter": true}""") ).writeDynamicFrame(dynamicFrame.repartition(10)) } } ``` Best
1
answers
0
votes
52
views
lp_evan
asked a month ago
I do a crawler to load all my S3 csv files to Glue Data Catalog. Now I want to create a glue job to execute ETL (create and drop temporary tables, select and insert data to tables in Data Catalog) But in the glue job as a python shell I have to split my sql statements to execute one by one. With the following code, I got an error. client = boto3.client('athena') client.start_query_execution(QueryString = """drop table if exists database1.temptable1 ;CREATE EXTERNAL TABLE IF NOT EXISTS temptable1(id int ) """, ResultConfiguration = config) Is there any way to run multiple sql statements in glue job?
0
answers
0
votes
27
views
asked a month ago
Data layer is not my thing and I need some guidance. I create a glue crawler to extract compressed JSON files and store them in an aws S3 bucket. I recently learned that I can use Athena to directly connect to the glue database. When I do select * from *table-name* It starts to load but then errors with long string of stuff HIVE_METASTORE_ERROR: Error: : expected at the position 407 of 'struct<http:struct<status_code:int,url_details:struct<path:string,queryString:struct<prefix:string,versioning:string,logging:string,encoding-type:string,nodes:string,policy:string,acl:string,policyStatus:string,replication:string,notification:string,tagging:string,website:string,encryption:string,size:string,limit:string,hash:string,accelerate:string,publicAccessBlock:string,code:string,protocol:string,G%EF%BF%BD%EF%BF%BD%EF%BF%BD\,%EF%BF%BD%EF%BF%BD%EF%BF%BD`~%EF%BF%BD%00%EF%BF%BD%EF%BF%BD{%EF%BF%BD%D5%96%EF%BF%BDw%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3C:string,cors:string,object- etc etc etc. I can load one small table but the others fail.
0
answers
0
votes
35
views
asked a month ago
I can't find a proper way of setting the correct data type for a timestamp attribute on my Athena table **parquet** in order to query for time intervals. im creating the table via a crawler on parquet files resultant from a glue interactive job (PySpark) **below my Athena table and the timestamp formats im trying ** 1. ts (datatype **timestamp**) : 2023-02-07 23:59:59.460000 2. ts_iso_stamp (datatype **timestamp**) : 2023-02-07 23:59:59.460000 3. ts_iso_str (datatype **timestamp**) : "2023-02-07T23:59:59.460000+00:00" **im applying this mapping in my glue job to create the various formats** ``` ApplyMapping.apply(frame = dyf, mappings = [ ("date", "date", "date", "date") ,("ts", "string", "ts", "timestamp") ,("ts_iso", "string", "ts_iso_stamp", "timestamp") ,("ts_iso", "string", "ts_iso_str", "string") ] ``` **ts_iso looks like this when you run .show() on the Dynamic Dataframe.** a valid iso format as far as I understand. ``` "ts_iso": "2023-02-07T23:59:59.460000+00:00" ``` ![Enter image description here](/media/postImages/original/IMwuWLVJkESiWoniIlgFhD8A) query examples failing same error ``` select * from searchdb.queries_lu where appid = 'aHs7sPcX66ytv2FqmRnv' AND ts >= TIMESTAMP '2023-02-07 23:59:59.405000' AND ts <= TIMESTAMP '2023-02-08 00:00:00.637000'; select * from searchdb.queries_lu where appid = 'aHs7sPcX66ytv2FqmRnv' AND ts_iso_stamp >= TIMESTAMP '2023-02-07 23:59:59.405000' AND ts_iso_stamp <= TIMESTAMP '2023-02-08 00:00:00.637000'; select * from searchdb.queries_lu where appid = 'aHs7sPcX66ytv2FqmRnv' AND ts_iso_str >= TIMESTAMP '2023-02-07T23:59:59.405000+00:00' AND ts_iso_str <= TIMESTAMP '2023-02-08T00:00:00.637000+00:00'; ``` Athena DDL create table searchdb.queries_ac ( ts timestamp(23, 3) , ts_iso_stamp timestamp(23,3), ts_iso_str string(255), endpoint string(255), appid string(255), requestid string(255), "in" string(255), q string(255), qq string(255), results_id string(255), results_count string(255), requestline string(255), "date" string(255) );
1
answers
0
votes
27
views
asked a month ago
My requirement is like : I want to query the iceberg table present in another AWS account. Let's say I am a user of account A and want to query account B's iceberg tables present in that account's glue. I followed the steps from AWS docs [glue cross account](https://docs.aws.amazon.com/athena/latest/ug/security-iam-cross-account-glue-catalog-access.html) and [s3 cross account](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example2.html) to attach permission to accountB's glue and s3 bucket where the data is stored and the policy contains required permission with principal `account A:root` means any user from account A should be able to query. Then I also attached the glue and s3 polices to the account A's user. Then I go to Athena and create the data source as glue with catalog ID as account B's account ID and then I am able to see all the glue databases and tables of account B. But when I query the table such as `select * from table` it gives the error as `HIVE_METASTORE_ERROR: Table storage descriptor is missing SerDe info`. But I am able to query the table properly in account B. But yeah it's serDeInfo is empty.
0
answers
0
votes
40
views
asked a month ago
The [Glue JDBC Connection documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc) states: > If you already have a JDBC connection defined, you can reuse the configuration properties defined in it, such as: url, user and password; so you don't have to specify them in the code as connection options. To do so, use the following connection properties: >* "useConnectionProperties": Set it to "true" to indicate you want to use the configuration from a connection. >* >* "connectionName": Enter the connection name to retrieve the configuration from, the connection must be defined in the same region as the job. There is no further documentation of how to use these properties. I have tried to set these properties as kwargs to `glueContext.create_dynamic_frame.from_options()`, but the method continues to throw an error if `url` is not specified in `connectionOptions`: ``` dbtable = 'schema.table' query = f"select top 1 * from {dbtable}" dyf = glueContext.create_dynamic_frame.from_options(connection_type="sqlserver", useConnectionProperties="true", connectionName="My-Glue-Connection", connection_options={ "dbtable": dbtable, "sampleQuery": query }) Py4JJavaError: An error occurred while calling o212.getDynamicFrame. : java.util.NoSuchElementException: key not found: url at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at com.amazonaws.services.glue.util.JDBCWrapper$.apply(JDBCUtils.scala:913) at com.amazonaws.services.glue.util.JDBCWrapper$.apply(JDBCUtils.scala:909) at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:943) at com.amazonaws.services.glue.DataSource$class.getDynamicFrame(DataSource.scala:97) at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:709) ... ``` The same error occurs if I pass the two properties via `connection_options`: ``` dyf = glueContext.create_dynamic_frame.from_options(connection_type="sqlserver", connection_options={ "useConnectionProperties": "true", "connectionName": "IDAP-Glue-Connection", "dbtable": dbtable, "sampleQuery": query }) Py4JJavaError: An error occurred while calling o123.getDynamicFrame. : java.util.NoSuchElementException: key not found: url ... ``` How is this feature intended to be used? The only methods I've found to use a Glue Connection to read from a JDBC database are really unwieldy, I would expect tighter integration of Glue Connections in Glue Jobs.
1
answers
0
votes
26
views
asked a month ago
Sample code (minus boilerplate): ``` conn = glueContext.extract_jdbc_conf(connection_name="My-Glue-Connection") for i in conn.items: print(i) ``` Output from notebook: ``` ('enforceSSL', 'false') ('skipCustomJDBCCertValidation', 'false') ('url', 'jdbc:sqlserver://0.0.0.0:1433') ('customJDBCCertString', '') ('user', 'test') ('customJDBCCert', '') ('password', 'xxx') ('vendor', 'sqlserver') ``` Output from running job: ``` **('fullUrl', 'jdbc:sqlserver://0.0.0.0:1433/mydb')** ('enforceSSL', 'false') ('skipCustomJDBCCertValidation', 'false') ('url', 'jdbc:sqlserver://0.0.0.0:1433') ('customJDBCCertString', '') ('user', 'test') ('customJDBCCert', '') ('password', 'xxx') ('vendor', 'sqlserver') ``` `fullUrl` (and thus the name of the database to connect to) is not available when using this method in a notebook.
2
answers
0
votes
16
views
asked a month ago
AWS provides AWs Glue Data Quality, powered by DQDL. Is there a DQDL example for Time-series sensor data. It also offers "Data Quality and Insights Report in Sagemaker under Data Wrangler, but that is also not a great way, IMHO. Is there a tool in AWS that can provide custom report for Data Quality for Timeseries data? - Asking because the timeseries data is simple, usually having following fields, [Device, Timestamp, Sensorname, Value] where the sensors can be unevenly spaced timeseries etc. Too many options but nothing complete.
0
answers
0
votes
17
views
jinman
asked a month ago
I'm running many Glue jobs. How can we set the limit or alarm if jobs runs more than 6 hours or 10DPU hours. Is there any Cloudwatch metrics available on which alarm can be set for cost monitoring purpose.
1
answers
0
votes
34
views
asked a month ago
I am transforming my table by adding new columns using SQL Query Transform in AWS Glue Job studio. ![visual diagram for transformation](/media/postImages/original/IMwcLXRM0iTROC0Uqb5lvOGg) SQL aliases- study Existing Schema from data catalog - study id, patient id, patient age I want to transform the existing schema by adding new columns. new columns - AccessionNo Transformed schema - study id, patient id, patient age, AccessionNo SQL query - **alter table study add columns (AccessionNo int)** Error it gives- pyspark.sql.utils.AnalysisException: Invalid command: 'study' is a view not a table.; line 2 pos 0; 'AlterTable V2SessionCatalog(spark_catalog), default.study, 'UnresolvedV2Relation [study], V2SessionCatalog(spark_catalog), default.study, [org.apache.spark.sql.connector.catalog.TableChange$AddColumn@1e7cbfec] I tried looking at AWS official doc for SQL transform and it says queries should be in Spark Sql syntax and my query is also in Spark Sql syntax. https://docs.aws.amazon.com/glue/latest/ug/transforms-sql.html What is the exact issue and please help me resolve. Thanks
1
answers
0
votes
56
views
Prabhu
asked a month ago
1
answers
0
votes
42
views
asked a month ago
Hello, I'm creating a Glue Job using Jupyter notebook and I'm currently using Ray as the ETL type. After running the job once, I noticed I can no longer save my notebook or push it to a repository because the Glue Version randomly downgraded to 3.0 in the Job Details and I have no way to convert it back to 4.0.
1
answers
0
votes
31
views
asked a month ago