EMR Serverless not populating AWS Glue Catalog

0

I want to use AWS Glue Data Catalog as a metastore. I'm running an EMR Serverless job that inserts and updates data in a Delta Table. I've successfully populated Delta tables on my localhost computer. I'm trying populate the AWS Glue data catalog through my EMR Serverless job. My EMR Serverless job currently runs without error - the only problem is the AWS Glue data catalog is not getting populated.

I've followed the instructions here.

I start my EMR Serverless job via the AWS CLI. I add the following Spark parameter configuration as directed in the above documentation:

--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

I've also added glue:* permissions to the role that's used to execute my EMR Serverless job. I've checked the AWS Glue console but I don't see the table under Data Catalog tables. My Spark driver logs for my EMR Serverless job, specifically the standard error logs don't show anything regarding Glue. The only Hive-related log I see is:

INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.

which doesn't look too promising.

However, the EMR Serverless AWS console page for the job shows it recognizes AWS Glue Data Catalog as metastore in the Metastore configuration section.

Enter image description here

So am I doing something wrong or missing something?

已提問 25 天前檢視次數 119 次
2 個答案
2

Hello,

As you updating Delta table that uses Glue catalog, may I ask you to test the below sample and let me know the outcome,

  1. Configure your Spark session.

Configure the Spark Session. Set up Spark SQL extensions to use Delta lake.

%%configure -f
{
    "conf": {
        "spark.sql.extensions" : "io.delta.sql.DeltaSparkSessionExtension",
        "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog",
        "spark.jars": "/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar,/usr/share/aws/delta/lib/delta-storage-s3-dynamodb.jar",
        "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
}

  1. Create a Delta lake Table

We will create a Spark Dataframe with sample data and write this into a Delta lake table. NOTE : You will need to update my_bucket in the Spark SQL statement below to your own bucket. Please make sure you have read and write permissions for this bucket.

tableName = "delta_table"
basePath = "s3://my_bucket/aws_workshop/delta_data_location/" + tableName

data = spark.createDataFrame([
 ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
 ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
 ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
 ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z")
],["id", "creation_date", "last_update_time"])

data.write.format("delta"). \
  save(basePath)
  1. Query the table

We will read the table using spark.read into a Spark dataframe

df = spark.read.format("delta").load(basePath)
df.show()
AWS
支援工程師
已回答 25 天前
  • Hi Yokesh,

    Thanks for the reply. Let me clarify my issue.

    In the past, I have been able to save and load Delta tables on EMR Serverless and on localhost. My issue is when I added AWS Glue Data Catalog as a metastore (by specifying the Spark configuration parameters), the data catalog tables are not populated on AWS Glue even though the EMR Serverless job still runs fine.

    The suggestions you have above work for me - my code is very similar to this. But again, the Glue Data Catalog is not updated.

  • Hi There, It looks strange :-) If the EMR-S able write & read the delta table without any issues, then metadat would be persisted in Glue catalog. Just to confirm/in case not verified, could you please make sure the database referred by the table is exist and holds appropriate permission to list your end. You can run below command to check if they are visible,

    spark.sql('show databases').show()
    spark.sql('show tables from <Your database>').show()
    

    If they are visible here, the issue likely to be relies on how you describe the object. If not, then you can try enable debug log mode on your spark job and make sure they are writing/reading the appropriate table. If you find the issue after checking above pointers, please feel free to reach us via AWS Support for more assistance.

1
已接受的答案

I resolved the issue. Unfortunately, the AWS documentation is missing a configuration setting. The metastore configuration documentation is here:

https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/metastore-config.html#glue-metastore

In the Spark tab, the documentation is missing the following configuration setting to ensure that the catalog implementation is not using the in-memory Derby db which it defaults to if this configuration setting is missing:

--conf spark.sql.catalogImplementation=hive

Combined with the other documentation settings, I see the Delta table now show up on the AWS Glue Data Catalog. Glue Data Catalog seems to have trouble correctly parsing the schema, but that is a question for another post.

Please update the AWS EMR Serverless documentation to include this configuration setting. Thanks!

已回答 24 天前
AWS
支援工程師
已審閱 23 天前
  • Thats good catch. catalogImplementation should not be "in-memory" in this case.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南