how to enable delta lake in EMR serverless.

0

The application is working in EMR cluster , but failing in EMR serverless though the base image is emr 6.15.0

gefragt vor 5 Monaten619 Aufrufe
4 Antworten
3
Akzeptierte Antwort

Hello Muthukumar,

--packages not required if you are using EMR 6.9.0 onwards as the delta lake jar will be shipped to EMR image by default. In older versions, you have to import the compatible OSS Deltalake dependencies to make the session worked as expected as in your the dependent version might not be compatible. Certainly, I tried the latest version in EMR-S with below steps tested for your reference, working fine.

1. Configure your Spark session.

Configure the Spark Session. Set up Spark SQL extensions to use Delta lake.

%%configure -f
{
    "conf": {
        "spark.sql.extensions" : "io.delta.sql.DeltaSparkSessionExtension",
        "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog",
        "spark.jars": "/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar,/usr/share/aws/delta/lib/delta-storage-s3-dynamodb.jar",
        "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
}

2. Create a Delta lake Table

We will create a Spark Dataframe with sample data and write this into a Delta lake table.

tableName = "delta_table"
basePath = "s3://<Your S3 bucket>/test/delta/" + tableName
data = spark.createDataFrame([
 ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
 ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
 ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
 ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z")
],["id", "creation_date", "last_update_time"])

data.write.format("delta"). \
  save(basePath)

3. Query the table

We will read the table using spark.read into a Spark dataframe

df = spark.read.format("delta").load(basePath)
df.show()

+---+-------------+--------------------+
| id|creation_date|    last_update_time|
+---+-------------+--------------------+
|102|   2015-01-01|2015-01-01T13:51:...|
|103|   2015-01-01|2015-01-01T13:51:...|
|101|   2015-01-01|2015-01-01T12:14:...|
|100|   2015-01-01|2015-01-01T13:51:...|
+---+-------------+--------------------+
AWS
SUPPORT-TECHNIKER
beantwortet vor 5 Monaten
  • Hi yokesh,

    I tried on EMR-S 7.0.0 and got following error:

    Files file:/usr/share/aws/delta/lib/delta-core.jar from /usr/share/aws/delta/lib/delta-core.jar to /home/hadoop/delta-core.jar
    Exception in thread "main" java.nio.file.NoSuchFileException: /usr/share/aws/delta/lib/delta-core.jar
    

    Can you please advise once?

3

Hello,

I understand that your question is regarding to enabled Deltalake format in EMR-Serverless which is not working for some reason. Please correct if my understanding is incorrect.

Given that, you can follow this document to test DeltaLake in EMR-Serverless - https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-delta-lake.html and let me know if any queries.

AWS
SUPPORT-TECHNIKER
beantwortet vor 5 Monaten
profile picture
EXPERTE
überprüft vor 5 Monaten
2

Hi Yokesh,

I tried the above still I am still not able to invoke the spark session with delta enabled. Below are my configurations

--conf spark.jars=s3://<bucket name>/jars/delta-core_2.12-2.4.0.jar,s3://<bucket name>/jars/delta-storage-2.4.0.jar --conf spark.submit.pyFiles=s3://<bucket name>/scripts/code.zip --conf spark.jars.packages=io.delta:delta-core_2.12:2.0.0 --conf spark.archives=s3://<bucket name>/archives/pyspark_3.11.7.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python

if I use --conf spark.jars.packages=io.delta:delta-spark_2.12:3.0.0 , i am getting below error.

23/12/30 01:15:51 WARN SparkSession: Cannot use io.delta.sql.DeltaSparkSessionExtension to configure session extensions. java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/analysis/UnresolvedLeafNode

beantwortet vor 5 Monaten
1

Hi Yogesh, Thank you so much. it worked

beantwortet vor 5 Monaten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen