Questions tagged with AWS Glue
Content language: English
Sort by most recent
Hey Guys!
I am trying to Read a large amout of data(About 45GB in 5.500.000 files) in S3 and rewrite in a partitioned folder (In another Folder inside the same Bucket) but I am facing this error:
Exception in User Class: com.amazonaws.SdkClientException : Unable to execute HTTP request: readHandshakeRecord
When I tried with just one file in the same folder it works.
do you have any Ideia what could be the Problem?
Code(running using 60 DPUs, Glue 4.0):
```
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext
object GlueApp {
def main(sysArgs: Array[String]): Unit = {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
val dynamicFrame = glueContext.getSourceWithFormat(
connectionType="s3",
format="parquet",
options=JsonOptions("""{"paths": ["s3://bucket/raw-folder"],"recurse": true, "groupFiles": "inPartition", "useS3ListImplementation": true}""")
).getDynamicFrame()
glueContext.getSinkWithFormat(
connectionType="s3",
options=JsonOptions("""{"path": "s3://bucket/partition-folder"}"""),
format="parquet",
formatOptions=JsonOptions("""{"compression": "snappy","blockSize": 268435456, "pageSize": 1048576, "useGlueParquetWriter": true}""")
).writeDynamicFrame(dynamicFrame.repartition(10))
}
}
```
Best
I do a crawler to load all my S3 csv files to Glue Data Catalog. Now I want to create a glue job to execute ETL (create and drop temporary tables, select and insert data to tables in Data Catalog) But in the glue job as a python shell I have to split my sql statements to execute one by one. With the following code, I got an error.
client = boto3.client('athena')
client.start_query_execution(QueryString = """drop table if exists database1.temptable1 ;CREATE EXTERNAL TABLE IF NOT EXISTS temptable1(id int ) """, ResultConfiguration = config)
Is there any way to run multiple sql statements in glue job?
Data layer is not my thing and I need some guidance.
I create a glue crawler to extract compressed JSON files and store them in an aws S3 bucket. I recently learned that I can use Athena to directly connect to the glue database. When I do select * from *table-name* It starts to load but then errors with long string of stuff
HIVE_METASTORE_ERROR: Error: : expected at the position 407 of 'struct<http:struct<status_code:int,url_details:struct<path:string,queryString:struct<prefix:string,versioning:string,logging:string,encoding-type:string,nodes:string,policy:string,acl:string,policyStatus:string,replication:string,notification:string,tagging:string,website:string,encryption:string,size:string,limit:string,hash:string,accelerate:string,publicAccessBlock:string,code:string,protocol:string,G%EF%BF%BD%EF%BF%BD%EF%BF%BD\,%EF%BF%BD%EF%BF%BD%EF%BF%BD`~%EF%BF%BD%00%EF%BF%BD%EF%BF%BD{%EF%BF%BD%D5%96%EF%BF%BDw%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3C:string,cors:string,object- etc etc etc.
I can load one small table but the others fail.
I can't find a proper way of setting the correct data type for a timestamp attribute on my Athena table **parquet** in order to query for time intervals.
im creating the table via a crawler on parquet files resultant from a glue interactive job (PySpark)
**below my Athena table and the timestamp formats im trying **
1. ts (datatype **timestamp**) : 2023-02-07 23:59:59.460000
2. ts_iso_stamp (datatype **timestamp**) : 2023-02-07 23:59:59.460000
3. ts_iso_str (datatype **timestamp**) : "2023-02-07T23:59:59.460000+00:00"
**im applying this mapping in my glue job to create the various formats**
```
ApplyMapping.apply(frame = dyf, mappings = [ ("date", "date", "date", "date")
,("ts", "string", "ts", "timestamp")
,("ts_iso", "string", "ts_iso_stamp", "timestamp")
,("ts_iso", "string", "ts_iso_str", "string") ]
```
**ts_iso looks like this when you run .show() on the Dynamic Dataframe.** a valid iso format as far as I understand.
```
"ts_iso": "2023-02-07T23:59:59.460000+00:00"
```

query examples failing same error
```
select * from searchdb.queries_lu where appid = 'aHs7sPcX66ytv2FqmRnv'
AND ts >= TIMESTAMP '2023-02-07 23:59:59.405000'
AND ts <= TIMESTAMP '2023-02-08 00:00:00.637000';
select * from searchdb.queries_lu where appid = 'aHs7sPcX66ytv2FqmRnv'
AND ts_iso_stamp >= TIMESTAMP '2023-02-07 23:59:59.405000'
AND ts_iso_stamp <= TIMESTAMP '2023-02-08 00:00:00.637000';
select * from searchdb.queries_lu where appid = 'aHs7sPcX66ytv2FqmRnv'
AND ts_iso_str >= TIMESTAMP '2023-02-07T23:59:59.405000+00:00'
AND ts_iso_str <= TIMESTAMP '2023-02-08T00:00:00.637000+00:00';
```
Athena DDL
create table searchdb.queries_ac
(
ts timestamp(23, 3) ,
ts_iso_stamp timestamp(23,3),
ts_iso_str string(255),
endpoint string(255),
appid string(255),
requestid string(255),
"in" string(255),
q string(255),
qq string(255),
results_id string(255),
results_count string(255),
requestline string(255),
"date" string(255)
);
My requirement is like : I want to query the iceberg table present in another AWS account.
Let's say I am a user of account A and want to query account B's iceberg tables present in that account's glue. I followed the steps from AWS docs [glue cross account](https://docs.aws.amazon.com/athena/latest/ug/security-iam-cross-account-glue-catalog-access.html) and [s3 cross account](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example2.html)
to attach permission to accountB's glue and s3 bucket where the data is stored and the policy contains required permission with principal `account A:root` means any user from account A should be able to query. Then I also attached the glue and s3 polices to the account A's user.
Then I go to Athena and create the data source as glue with catalog ID as account B's account ID and then I am able to see all the glue databases and tables of account B.
But when I query the table such as `select * from table` it gives the error as `HIVE_METASTORE_ERROR: Table storage descriptor is missing SerDe info`.
But I am able to query the table properly in account B. But yeah it's serDeInfo is empty.
The [Glue JDBC Connection documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc) states:
> If you already have a JDBC connection defined, you can reuse the configuration properties defined in it, such as: url, user and password; so you don't have to specify them in the code as connection options. To do so, use the following connection properties:
>* "useConnectionProperties": Set it to "true" to indicate you want to use the configuration from a connection.
>*
>* "connectionName": Enter the connection name to retrieve the configuration from, the connection must be defined in the same region as the job.
There is no further documentation of how to use these properties. I have tried to set these properties as kwargs to `glueContext.create_dynamic_frame.from_options()`, but the method continues to throw an error if `url` is not specified in `connectionOptions`:
```
dbtable = 'schema.table'
query = f"select top 1 * from {dbtable}"
dyf = glueContext.create_dynamic_frame.from_options(connection_type="sqlserver",
useConnectionProperties="true",
connectionName="My-Glue-Connection",
connection_options={
"dbtable": dbtable,
"sampleQuery": query
})
Py4JJavaError: An error occurred while calling o212.getDynamicFrame.
: java.util.NoSuchElementException: key not found: url
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at com.amazonaws.services.glue.util.JDBCWrapper$.apply(JDBCUtils.scala:913)
at com.amazonaws.services.glue.util.JDBCWrapper$.apply(JDBCUtils.scala:909)
at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:943)
at com.amazonaws.services.glue.DataSource$class.getDynamicFrame(DataSource.scala:97)
at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:709)
...
```
The same error occurs if I pass the two properties via `connection_options`:
```
dyf = glueContext.create_dynamic_frame.from_options(connection_type="sqlserver",
connection_options={
"useConnectionProperties": "true",
"connectionName": "IDAP-Glue-Connection",
"dbtable": dbtable,
"sampleQuery": query
})
Py4JJavaError: An error occurred while calling o123.getDynamicFrame.
: java.util.NoSuchElementException: key not found: url
...
```
How is this feature intended to be used? The only methods I've found to use a Glue Connection to read from a JDBC database are really unwieldy, I would expect tighter integration of Glue Connections in Glue Jobs.
Sample code (minus boilerplate):
```
conn = glueContext.extract_jdbc_conf(connection_name="My-Glue-Connection")
for i in conn.items:
print(i)
```
Output from notebook:
```
('enforceSSL', 'false')
('skipCustomJDBCCertValidation', 'false')
('url', 'jdbc:sqlserver://0.0.0.0:1433')
('customJDBCCertString', '')
('user', 'test')
('customJDBCCert', '')
('password', 'xxx')
('vendor', 'sqlserver')
```
Output from running job:
```
**('fullUrl', 'jdbc:sqlserver://0.0.0.0:1433/mydb')**
('enforceSSL', 'false')
('skipCustomJDBCCertValidation', 'false')
('url', 'jdbc:sqlserver://0.0.0.0:1433')
('customJDBCCertString', '')
('user', 'test')
('customJDBCCert', '')
('password', 'xxx')
('vendor', 'sqlserver')
```
`fullUrl` (and thus the name of the database to connect to) is not available when using this method in a notebook.
AWS provides AWs Glue Data Quality, powered by DQDL. Is there a DQDL example for Time-series sensor data. It also offers "Data Quality and Insights Report in Sagemaker under Data Wrangler, but that is also not a great way, IMHO. Is there a tool in AWS that can provide custom report for Data Quality for Timeseries data? - Asking because the timeseries data is simple, usually having following fields, [Device, Timestamp, Sensorname, Value] where the sensors can be unevenly spaced timeseries etc. Too many options but nothing complete.
I'm running many Glue jobs. How can we set the limit or alarm if jobs runs more than 6 hours or 10DPU hours. Is there any Cloudwatch metrics available on which alarm can be set for cost monitoring purpose.
I am transforming my table by adding new columns using SQL Query Transform in AWS Glue Job studio.

SQL aliases- study
Existing Schema from data catalog - study id, patient id, patient age
I want to transform the existing schema by adding new columns.
new columns - AccessionNo
Transformed schema - study id, patient id, patient age, AccessionNo
SQL query - **alter table study add columns (AccessionNo int)**
Error it gives-
pyspark.sql.utils.AnalysisException: Invalid command: 'study' is a view not a table.; line 2 pos 0;
'AlterTable V2SessionCatalog(spark_catalog), default.study, 'UnresolvedV2Relation [study], V2SessionCatalog(spark_catalog), default.study, [org.apache.spark.sql.connector.catalog.TableChange$AddColumn@1e7cbfec]
I tried looking at AWS official doc for SQL transform and it says queries should be in Spark Sql syntax and my query is also in Spark Sql syntax.
https://docs.aws.amazon.com/glue/latest/ug/transforms-sql.html
What is the exact issue and please help me resolve.
Thanks
Not able to display paruqet file in athena , which json is converted into parquet using lambda function
Hello, I'm creating a Glue Job using Jupyter notebook and I'm currently using Ray as the ETL type.
After running the job once, I noticed I can no longer save my notebook or push it to a repository because
the Glue Version randomly downgraded to 3.0 in the Job Details and I have no way to convert it back to 4.0.