Apache Hudi on Amazon EMR and AWS Database

0

I created the tutorial from this link successfully But a trying make that using other data and table I don't have success. I receive this error:

hadoop@ip-10-99-2-111 bin]$ spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer    --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.2-incubating,org.apache.spark:spark-avro_2.11:2.4.5   --master yarn --deploy-mode cluster --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false /usr/lib/hudi/hudi-utilities-bundle_2.11-0.5.2-incubating.jar   --table-type COPY_ON_WRITE   --source-ordering-field dms_received_ts   --props s3://hudi-test-tt/properties/dfs-source-health-care-full.properties   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource   --target-base-path s3://hudi-test-tt/hudi/health_care --target-table hudiblogdb.health_care   --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer     --payload-class org.apache.hudi.payload.AWSDmsAvroPayload --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider   --enable-hive-sync
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hudi#hudi-utilities-bundle_2.11 added as a dependency
org.apache.spark#spark-avro_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-fbc63aec-b48f-4ef4-bc38-f788919cf31c;1.0
        confs: [default]
        found org.apache.hudi#hudi-utilities-bundle_2.11;0.5.2-incubating in central
        found org.apache.spark#spark-avro_2.11;2.4.5 in central
        found org.spark-project.spark#unused;1.0.0 in central
:: resolution report :: resolve 270ms :: artifacts dl 7ms
        :: modules in use:
        org.apache.hudi#hudi-utilities-bundle_2.11;0.5.2-incubating from central in [default]
        org.apache.spark#spark-avro_2.11;2.4.5 from central in [default]
        org.spark-project.spark#unused;1.0.0 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-fbc63aec-b48f-4ef4-bc38-f788919cf31c
        confs: [default]
        0 artifacts copied, 3 already retrieved (0kB/7ms)
22/08/25 21:39:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/08/25 21:39:38 INFO RMProxy: Connecting to ResourceManager at ip-10-99-2-111.us-east-2.compute.internal/10.99.2.111:8032
22/08/25 21:39:38 INFO Client: Requesting a new application from cluster with 1 NodeManagers
22/08/25 21:39:38 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container)
22/08/25 21:39:38 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
22/08/25 21:39:38 INFO Client: Setting up container launch context for our AM
22/08/25 21:39:38 INFO Client: Setting up the launch environment for our AM container
22/08/25 21:39:39 INFO Client: Preparing resources for our AM container
22/08/25 21:39:39 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
22/08/25 21:39:41 INFO Client: Uploading resource file:/mnt/tmp/spark-4c327077-6693-4371-9e41-10e2342e0200/__spark_libs__5969710364624957851.zip -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/__spark_libs__5969710364624957851.zip
22/08/25 21:39:41 INFO Client: Uploading resource file:/usr/lib/hudi/hudi-utilities-bundle_2.11-0.5.2-incubating.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/hudi-utilities-bundle_2.11-0.5.2-incubating.jar
22/08/25 21:39:41 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.hudi_hudi-utilities-bundle_2.11-0.5.2-incubating.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/org.apache.hudi_hudi-utilities-bundle_2.11-0.5.2-incubating.jar
22/08/25 21:39:41 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.spark_spark-avro_2.11-2.4.5.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/org.apache.spark_spark-avro_2.11-2.4.5.jar
22/08/25 21:39:41 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/org.spark-project.spark_unused-1.0.0.jar
22/08/25 21:39:41 INFO Client: Uploading resource file:/etc/spark/conf/hive-site.xml -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/hive-site.xml
22/08/25 21:39:42 INFO Client: Uploading resource file:/mnt/tmp/spark-4c327077-6693-4371-9e41-10e2342e0200/__spark_conf__6985991088000323368.zip -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/__spark_conf__.zip
22/08/25 21:39:42 INFO SecurityManager: Changing view acls to: hadoop
22/08/25 21:39:42 INFO SecurityManager: Changing modify acls to: hadoop
22/08/25 21:39:42 INFO SecurityManager: Changing view acls groups to:
22/08/25 21:39:42 INFO SecurityManager: Changing modify acls groups to:
22/08/25 21:39:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
22/08/25 21:39:43 INFO Client: Submitting application application_1661296163923_0003 to ResourceManager
22/08/25 21:39:43 INFO YarnClientImpl: Submitted application application_1661296163923_0003
22/08/25 21:39:44 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:44 INFO Client:
         client token: N/A
         diagnostics: AM container is launched, waiting for AM container to Register with RM
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1661463583358
         final status: UNDEFINED
         tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/
         user: hadoop
22/08/25 21:39:45 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:46 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:47 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:48 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:49 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:50 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:39:50 INFO Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: ip-10-99-2-253.us-east-2.compute.internal
         ApplicationMaster RPC port: 33179
         queue: default
         start time: 1661463583358
         final status: UNDEFINED
         tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/
         user: hadoop
22/08/25 21:39:51 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:39:52 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:39:53 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:39:54 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:39:55 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:55 INFO Client:
         client token: N/A
         diagnostics: AM container is launched, waiting for AM container to Register with RM
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1661463583358
         final status: UNDEFINED
         tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/
         user: hadoop
22/08/25 21:39:56 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:57 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:58 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:59 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:40:00 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:40:00 INFO Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: ip-10-99-2-253.us-east-2.compute.internal
         ApplicationMaster RPC port: 34591
         queue: default
         start time: 1661463583358
         final status: UNDEFINED
         tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/
         user: hadoop
22/08/25 21:40:01 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:40:02 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:40:03 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:40:04 INFO Client: Application report for application_1661296163923_0003 (state: FINISHED)
22/08/25 21:40:04 INFO Client:
         client token: N/A
         diagnostics: User class threw exception: java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
        at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:101)
        at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:364)
        at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:95)
        at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:89)
        at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)
Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class
        at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:80)
        at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:89)
        at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:99)
        ... 9 more
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:78)
        ... 11 more
Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Required property hoodie.deltastreamer.schemaprovider.source.schema.file is missing
        at org.apache.hudi.DataSourceUtils.lambda$checkRequiredProperties$1(DataSourceUtils.java:173)
        at java.util.Collections$SingletonList.forEach(Collections.java:4824)
        at org.apache.hudi.DataSourceUtils.checkRequiredProperties(DataSourceUtils.java:171)
        at org.apache.hudi.utilities.schema.FilebasedSchemaProvider.<init>(FilebasedSchemaProvider.java:55)
        ... 16 more

         ApplicationMaster host: ip-10-99-2-253.us-east-2.compute.internal
         ApplicationMaster RPC port: 34591
         queue: default
         start time: 1661463583358
         final status: FAILED
         tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/
         user: hadoop
22/08/25 21:40:04 ERROR Client: Application diagnostics message: User class threw exception: java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
        at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:101)
        at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:364)
        at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:95)
        at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:89)
        at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)
Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class
        at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:80)
        at org.apache.hudi.common.util.ReflectionUtils.loadClass(Refl
  • To do this I already changed the S3 files: schema, properties changing to the new data and added the new data correctly in s3. Do I need to do anything else for this to work?

gefragt vor 2 Jahren538 Aufrufe
1 Antwort
0

Hi,

You need to specify the property hoodie.deltastreamer.schemaprovider.source.schema.file as can be seen on the log traces.

Let me know if you succeed!

profile pictureAWS
beantwortet vor 2 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen