Apache Hudi on Amazon EMR and AWS Database

0

I created the tutorial from this link successfully But a trying make that using other data and table I don't have success. I receive this error:

hadoop@ip-10-99-2-111 bin]$ spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer    --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.2-incubating,org.apache.spark:spark-avro_2.11:2.4.5   --master yarn --deploy-mode cluster --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false /usr/lib/hudi/hudi-utilities-bundle_2.11-0.5.2-incubating.jar   --table-type COPY_ON_WRITE   --source-ordering-field dms_received_ts   --props s3://hudi-test-tt/properties/dfs-source-health-care-full.properties   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource   --target-base-path s3://hudi-test-tt/hudi/health_care --target-table hudiblogdb.health_care   --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer     --payload-class org.apache.hudi.payload.AWSDmsAvroPayload --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider   --enable-hive-sync
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hudi#hudi-utilities-bundle_2.11 added as a dependency
org.apache.spark#spark-avro_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-fbc63aec-b48f-4ef4-bc38-f788919cf31c;1.0
        confs: [default]
        found org.apache.hudi#hudi-utilities-bundle_2.11;0.5.2-incubating in central
        found org.apache.spark#spark-avro_2.11;2.4.5 in central
        found org.spark-project.spark#unused;1.0.0 in central
:: resolution report :: resolve 270ms :: artifacts dl 7ms
        :: modules in use:
        org.apache.hudi#hudi-utilities-bundle_2.11;0.5.2-incubating from central in [default]
        org.apache.spark#spark-avro_2.11;2.4.5 from central in [default]
        org.spark-project.spark#unused;1.0.0 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-fbc63aec-b48f-4ef4-bc38-f788919cf31c
        confs: [default]
        0 artifacts copied, 3 already retrieved (0kB/7ms)
22/08/25 21:39:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/08/25 21:39:38 INFO RMProxy: Connecting to ResourceManager at ip-10-99-2-111.us-east-2.compute.internal/10.99.2.111:8032
22/08/25 21:39:38 INFO Client: Requesting a new application from cluster with 1 NodeManagers
22/08/25 21:39:38 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container)
22/08/25 21:39:38 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
22/08/25 21:39:38 INFO Client: Setting up container launch context for our AM
22/08/25 21:39:38 INFO Client: Setting up the launch environment for our AM container
22/08/25 21:39:39 INFO Client: Preparing resources for our AM container
22/08/25 21:39:39 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
22/08/25 21:39:41 INFO Client: Uploading resource file:/mnt/tmp/spark-4c327077-6693-4371-9e41-10e2342e0200/__spark_libs__5969710364624957851.zip -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/__spark_libs__5969710364624957851.zip
22/08/25 21:39:41 INFO Client: Uploading resource file:/usr/lib/hudi/hudi-utilities-bundle_2.11-0.5.2-incubating.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/hudi-utilities-bundle_2.11-0.5.2-incubating.jar
22/08/25 21:39:41 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.hudi_hudi-utilities-bundle_2.11-0.5.2-incubating.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/org.apache.hudi_hudi-utilities-bundle_2.11-0.5.2-incubating.jar
22/08/25 21:39:41 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.spark_spark-avro_2.11-2.4.5.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/org.apache.spark_spark-avro_2.11-2.4.5.jar
22/08/25 21:39:41 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/org.spark-project.spark_unused-1.0.0.jar
22/08/25 21:39:41 INFO Client: Uploading resource file:/etc/spark/conf/hive-site.xml -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/hive-site.xml
22/08/25 21:39:42 INFO Client: Uploading resource file:/mnt/tmp/spark-4c327077-6693-4371-9e41-10e2342e0200/__spark_conf__6985991088000323368.zip -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/__spark_conf__.zip
22/08/25 21:39:42 INFO SecurityManager: Changing view acls to: hadoop
22/08/25 21:39:42 INFO SecurityManager: Changing modify acls to: hadoop
22/08/25 21:39:42 INFO SecurityManager: Changing view acls groups to:
22/08/25 21:39:42 INFO SecurityManager: Changing modify acls groups to:
22/08/25 21:39:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
22/08/25 21:39:43 INFO Client: Submitting application application_1661296163923_0003 to ResourceManager
22/08/25 21:39:43 INFO YarnClientImpl: Submitted application application_1661296163923_0003
22/08/25 21:39:44 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:44 INFO Client:
         client token: N/A
         diagnostics: AM container is launched, waiting for AM container to Register with RM
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1661463583358
         final status: UNDEFINED
         tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/
         user: hadoop
22/08/25 21:39:45 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:46 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:47 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:48 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:49 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:50 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:39:50 INFO Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: ip-10-99-2-253.us-east-2.compute.internal
         ApplicationMaster RPC port: 33179
         queue: default
         start time: 1661463583358
         final status: UNDEFINED
         tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/
         user: hadoop
22/08/25 21:39:51 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:39:52 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:39:53 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:39:54 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:39:55 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:55 INFO Client:
         client token: N/A
         diagnostics: AM container is launched, waiting for AM container to Register with RM
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1661463583358
         final status: UNDEFINED
         tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/
         user: hadoop
22/08/25 21:39:56 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:57 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:58 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:59 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:40:00 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:40:00 INFO Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: ip-10-99-2-253.us-east-2.compute.internal
         ApplicationMaster RPC port: 34591
         queue: default
         start time: 1661463583358
         final status: UNDEFINED
         tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/
         user: hadoop
22/08/25 21:40:01 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:40:02 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:40:03 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:40:04 INFO Client: Application report for application_1661296163923_0003 (state: FINISHED)
22/08/25 21:40:04 INFO Client:
         client token: N/A
         diagnostics: User class threw exception: java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
        at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:101)
        at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:364)
        at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:95)
        at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:89)
        at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)
Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class
        at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:80)
        at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:89)
        at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:99)
        ... 9 more
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:78)
        ... 11 more
Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Required property hoodie.deltastreamer.schemaprovider.source.schema.file is missing
        at org.apache.hudi.DataSourceUtils.lambda$checkRequiredProperties$1(DataSourceUtils.java:173)
        at java.util.Collections$SingletonList.forEach(Collections.java:4824)
        at org.apache.hudi.DataSourceUtils.checkRequiredProperties(DataSourceUtils.java:171)
        at org.apache.hudi.utilities.schema.FilebasedSchemaProvider.<init>(FilebasedSchemaProvider.java:55)
        ... 16 more

         ApplicationMaster host: ip-10-99-2-253.us-east-2.compute.internal
         ApplicationMaster RPC port: 34591
         queue: default
         start time: 1661463583358
         final status: FAILED
         tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/
         user: hadoop
22/08/25 21:40:04 ERROR Client: Application diagnostics message: User class threw exception: java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
        at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:101)
        at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:364)
        at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:95)
        at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:89)
        at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)
Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class
        at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:80)
        at org.apache.hudi.common.util.ReflectionUtils.loadClass(Refl
  • To do this I already changed the S3 files: schema, properties changing to the new data and added the new data correctly in s3. Do I need to do anything else for this to work?

asked 2 years ago527 views
1 Answer
0

Hi,

You need to specify the property hoodie.deltastreamer.schemaprovider.source.schema.file as can be seen on the log traces.

Let me know if you succeed!

profile pictureAWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions