Questions tagged with AWS Glue
Content language: English
Sort by most recent
I am using an IAM role AWSGlueServiceRole created in AWS Glue and tried to create the crwaler to run on S3 source. The error I get is
The following crawler failed to create: "abc"
Here is the most recent error message: Account XXX is denied access.
Also tried with another role that I created with below policies. But still get the same error.
AmazonS3FullAccess
AWSGlueServiceRole
AdministratorAccess
AWSGlueConsoleFullAccess
AWSGlueSchemaRegistryFullAccess
AWSGlueDataBrewServiceRole

I am trying to use the AWS Glue Studio to build a simple ETL workflow. Basically, I have a bunch of `csv` files in different directories in S3.
I want those csvs to be accessible via a database and have chosen Redshift for the job. The directories and will be updated every day with new csv files. The file structure is:
YYYY-MM-DD (e.g. 2023-03-07)
|---- groupName1
|---- groupName1.csv
|---- groupName2
|---- groupName2.csv
...
|---- groupNameN
|---- groupNameN.csv
We will be keeping historical data, so every day I will have a new date-based directory.
I've read that AWS Glue can automatically copy data on a schedule but I can't see my Redshift databases or tables (screenshot below). I'm using my AWS admin account and I do have `AWSGlueConsoleFullAccess` permission (screenshot below)


Hi All,
I have some issues when running my glue job, I landed my pipe delimited csv file in a s3 bucket and after running the crawler pointing to the folder where the file is placed, a glue catalog table is created.
However when I tried to read the data(code below) from the catalog table in a glue job for additional processing and converting to parquet, its not picking all the records.
dyf = glueContext.create_dynamic_frame.from_catalog(
database=DATABASE,
table_name=table_name,
transformation_ctx="dyf-" + table_name,
)
rows = dyf.count()
print(f"DataFrame records count : {rows}")
Please can someone suggest what could be the reason for the missing records? I see that there are three columns in the catalog table with incorrect data type( bigint in place of string). I went and manually corrected the data type and set the infer_schema = True in the above code. job is still not picking up the correct number of records.
I am getting error when I am running AWS Glue Job with Data Quality Check!
ModuleNotFoundError: No module named 'awsgluedq'
Is there anyone can help?
Thanks,
We have a Jupyter Notebook Glue Job and We're calling start_job_run from a Lambda with Python with boto3, we would like to change the default MaxConcurrentRuns (1) and MaxRetries (3) of the Job, as a Notebook we need to use magics, we already tried with magics and also we tried using Arguments in the Lambda that runs the Job, nothing seems to work, how should we set those settings to guarantee the Jupyter Notebook never will retry and will have a concurrency of 5, for example, we would like to have max 5 Jobs running at the same time.
Available Amazon SageMaker Kernels include [the following two Spark kernels](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-kernels.html):
- PySpark (SparkMagic) with Python 3.7
- Spark (SparkMagic) with Python 3.7
- Spark Analytics 1.0
- Spark Analytics 2.0
And at re:Invent 2022 there was [an announcement](https://aws.amazon.com/about-aws/whats-new/2022/09/sagemaker-studio-supports-glue-interactive-sessions/) that "SageMaker Studio now supports Glue Interactive Sessions." "The built-in Glue PySpark or Glue Spark kernel for your Studio notebook to initialize interactive, serverless Spark sessions."
It seems like the benefits of using one of the Glue Spark kernels are that you can "quickly browse the Glue data catalog, run large queries, and interactively analyze and prepare data using Spark, right in your Studio notebook." But can't you already do all that with the existing two SageMaker kernels?
In other words, how do you choose whether to use one of the existing two SparkMagic kernels in SageMaker Studio notebooks or to use this new Glue Interactive Sessions feature?
Hey Guys!
I am trying to Read a large amout of data(About 45GB in 5.500.000 files) in S3 and rewrite in a partitioned folder (In another Folder inside the same Bucket) but I am facing this error:
Exception in User Class: com.amazonaws.SdkClientException : Unable to execute HTTP request: readHandshakeRecord
When I tried with just one file in the same folder it works.
do you have any Ideia what could be the Problem?
Code(running using 60 DPUs, Glue 4.0):
```
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext
object GlueApp {
def main(sysArgs: Array[String]): Unit = {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
val dynamicFrame = glueContext.getSourceWithFormat(
connectionType="s3",
format="parquet",
options=JsonOptions("""{"paths": ["s3://bucket/raw-folder"],"recurse": true, "groupFiles": "inPartition", "useS3ListImplementation": true}""")
).getDynamicFrame()
glueContext.getSinkWithFormat(
connectionType="s3",
options=JsonOptions("""{"path": "s3://bucket/partition-folder"}"""),
format="parquet",
formatOptions=JsonOptions("""{"compression": "snappy","blockSize": 268435456, "pageSize": 1048576, "useGlueParquetWriter": true}""")
).writeDynamicFrame(dynamicFrame.repartition(10))
}
}
```
Best
I do a crawler to load all my S3 csv files to Glue Data Catalog. Now I want to create a glue job to execute ETL (create and drop temporary tables, select and insert data to tables in Data Catalog) But in the glue job as a python shell I have to split my sql statements to execute one by one. With the following code, I got an error.
client = boto3.client('athena')
client.start_query_execution(QueryString = """drop table if exists database1.temptable1 ;CREATE EXTERNAL TABLE IF NOT EXISTS temptable1(id int ) """, ResultConfiguration = config)
Is there any way to run multiple sql statements in glue job?
Data layer is not my thing and I need some guidance.
I create a glue crawler to extract compressed JSON files and store them in an aws S3 bucket. I recently learned that I can use Athena to directly connect to the glue database. When I do select * from *table-name* It starts to load but then errors with long string of stuff
HIVE_METASTORE_ERROR: Error: : expected at the position 407 of 'struct<http:struct<status_code:int,url_details:struct<path:string,queryString:struct<prefix:string,versioning:string,logging:string,encoding-type:string,nodes:string,policy:string,acl:string,policyStatus:string,replication:string,notification:string,tagging:string,website:string,encryption:string,size:string,limit:string,hash:string,accelerate:string,publicAccessBlock:string,code:string,protocol:string,G%EF%BF%BD%EF%BF%BD%EF%BF%BD\,%EF%BF%BD%EF%BF%BD%EF%BF%BD`~%EF%BF%BD%00%EF%BF%BD%EF%BF%BD{%EF%BF%BD%D5%96%EF%BF%BDw%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3C:string,cors:string,object- etc etc etc.
I can load one small table but the others fail.
I can't find a proper way of setting the correct data type for a timestamp attribute on my Athena table **parquet** in order to query for time intervals.
im creating the table via a crawler on parquet files resultant from a glue interactive job (PySpark)
**below my Athena table and the timestamp formats im trying **
1. ts (datatype **timestamp**) : 2023-02-07 23:59:59.460000
2. ts_iso_stamp (datatype **timestamp**) : 2023-02-07 23:59:59.460000
3. ts_iso_str (datatype **timestamp**) : "2023-02-07T23:59:59.460000+00:00"
**im applying this mapping in my glue job to create the various formats**
```
ApplyMapping.apply(frame = dyf, mappings = [ ("date", "date", "date", "date")
,("ts", "string", "ts", "timestamp")
,("ts_iso", "string", "ts_iso_stamp", "timestamp")
,("ts_iso", "string", "ts_iso_str", "string") ]
```
**ts_iso looks like this when you run .show() on the Dynamic Dataframe.** a valid iso format as far as I understand.
```
"ts_iso": "2023-02-07T23:59:59.460000+00:00"
```

query examples failing same error
```
select * from searchdb.queries_lu where appid = 'aHs7sPcX66ytv2FqmRnv'
AND ts >= TIMESTAMP '2023-02-07 23:59:59.405000'
AND ts <= TIMESTAMP '2023-02-08 00:00:00.637000';
select * from searchdb.queries_lu where appid = 'aHs7sPcX66ytv2FqmRnv'
AND ts_iso_stamp >= TIMESTAMP '2023-02-07 23:59:59.405000'
AND ts_iso_stamp <= TIMESTAMP '2023-02-08 00:00:00.637000';
select * from searchdb.queries_lu where appid = 'aHs7sPcX66ytv2FqmRnv'
AND ts_iso_str >= TIMESTAMP '2023-02-07T23:59:59.405000+00:00'
AND ts_iso_str <= TIMESTAMP '2023-02-08T00:00:00.637000+00:00';
```
Athena DDL
create table searchdb.queries_ac
(
ts timestamp(23, 3) ,
ts_iso_stamp timestamp(23,3),
ts_iso_str string(255),
endpoint string(255),
appid string(255),
requestid string(255),
"in" string(255),
q string(255),
qq string(255),
results_id string(255),
results_count string(255),
requestline string(255),
"date" string(255)
);
My requirement is like : I want to query the iceberg table present in another AWS account.
Let's say I am a user of account A and want to query account B's iceberg tables present in that account's glue. I followed the steps from AWS docs [glue cross account](https://docs.aws.amazon.com/athena/latest/ug/security-iam-cross-account-glue-catalog-access.html) and [s3 cross account](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example2.html)
to attach permission to accountB's glue and s3 bucket where the data is stored and the policy contains required permission with principal `account A:root` means any user from account A should be able to query. Then I also attached the glue and s3 polices to the account A's user.
Then I go to Athena and create the data source as glue with catalog ID as account B's account ID and then I am able to see all the glue databases and tables of account B.
But when I query the table such as `select * from table` it gives the error as `HIVE_METASTORE_ERROR: Table storage descriptor is missing SerDe info`.
But I am able to query the table properly in account B. But yeah it's serDeInfo is empty.
The [Glue JDBC Connection documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc) states:
> If you already have a JDBC connection defined, you can reuse the configuration properties defined in it, such as: url, user and password; so you don't have to specify them in the code as connection options. To do so, use the following connection properties:
>* "useConnectionProperties": Set it to "true" to indicate you want to use the configuration from a connection.
>*
>* "connectionName": Enter the connection name to retrieve the configuration from, the connection must be defined in the same region as the job.
There is no further documentation of how to use these properties. I have tried to set these properties as kwargs to `glueContext.create_dynamic_frame.from_options()`, but the method continues to throw an error if `url` is not specified in `connectionOptions`:
```
dbtable = 'schema.table'
query = f"select top 1 * from {dbtable}"
dyf = glueContext.create_dynamic_frame.from_options(connection_type="sqlserver",
useConnectionProperties="true",
connectionName="My-Glue-Connection",
connection_options={
"dbtable": dbtable,
"sampleQuery": query
})
Py4JJavaError: An error occurred while calling o212.getDynamicFrame.
: java.util.NoSuchElementException: key not found: url
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at com.amazonaws.services.glue.util.JDBCWrapper$.apply(JDBCUtils.scala:913)
at com.amazonaws.services.glue.util.JDBCWrapper$.apply(JDBCUtils.scala:909)
at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:943)
at com.amazonaws.services.glue.DataSource$class.getDynamicFrame(DataSource.scala:97)
at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:709)
...
```
The same error occurs if I pass the two properties via `connection_options`:
```
dyf = glueContext.create_dynamic_frame.from_options(connection_type="sqlserver",
connection_options={
"useConnectionProperties": "true",
"connectionName": "IDAP-Glue-Connection",
"dbtable": dbtable,
"sampleQuery": query
})
Py4JJavaError: An error occurred while calling o123.getDynamicFrame.
: java.util.NoSuchElementException: key not found: url
...
```
How is this feature intended to be used? The only methods I've found to use a Glue Connection to read from a JDBC database are really unwieldy, I would expect tighter integration of Glue Connections in Glue Jobs.