- Newest
- Most votes
- Most comments
Thnx, by the way: I do ETL DynamoDB->S3 , 5 tables transformed ok, but on the 6th I received: An error occurred while calling o397.pyWriteDynamicFrame. Unsupported case of DataType: com.amazonaws.services.glue.schema.types.StringType@4a2a46aa and DynamicNode: longnode. What can be the reason? where to check?
That is the log
2022-03-21 15:13:32,301 ERROR [Executor task launch worker for task 5.0 in stage 13.0 (TID 132)] util.Utils (Logging.scala:logError(94)): Aborting task java.lang.RuntimeException: Unsupported case of DataType: com.amazonaws.services.glue.schema.types.StringType@4a4c524e and DynamicNode: longnode. at scala.sys.package$.error(package.scala:30) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.writeField(ParquetWriteSupport.scala:144) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.$anonfun$writeFields$2(ParquetWriteSupport.scala:82) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.consumeField(ParquetWriteSupport.scala:322) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.$anonfun$writeFields$1(ParquetWriteSupport.scala:82) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.$anonfun$writeFields$1$adapted(ParquetWriteSupport.scala:78) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.writeFields(ParquetWriteSupport.scala:78) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.$anonfun$write$1(ParquetWriteSupport.scala:70) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:310) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:70) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:32) at org.apache.parquet.nimble.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:161) at org.apache.parquet.nimble.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:87) at com.amazonaws.services.glue.writers.parquet.ParquetWriter.write(ParquetWriter.scala:157) at com.amazonaws.services.glue.writers.DynamicRecordWriter.safeWrite(DynamicRecordWriter.scala:52) at com.amazonaws.services.glue.writers.parquet.ParquetWriterFactory$$anon$2.write(ParquetWriter.scala:239) at com.amazonaws.services.glue.writers.parquet.ParquetWriterFactory$$anon$2.write(ParquetWriter.scala:238) at com.amazonaws.services.glue.sinks.GlueParquetHadoopWriter.$anonfun$writeParquetNotPartitioned$1(GlueParquetHadoopWriter.scala:53) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473) at org.apache.spark.sql.glue.SparkUtility$.tryWithSafeFinallyAndFailureCallbacks(SparkUtility.scala:39) at com.amazonaws.services.glue.sinks.GlueParquetHadoopWriter.writeParquetNotPartitioned(GlueParquetHadoopWriter.scala:64) at com.amazonaws.services.glue.sinks.GlueParquetHadoopWriter.$anonfun$doParquetWrite$1(GlueParquetHadoopWriter.scala:179) at com.amazonaws.services.glue.sinks.GlueParquetHadoopWriter.$anonfun$doParquetWrite$1$adapted(GlueParquetHadoopWriter.scala:178) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
To clarify the question:
- you have already crawled your DynamoDB
- you have already 26 tables in the Glue Catalog pointing directly to DynamoDB
- you do not want to access DynamoDB through Athena
- you want to export the data to S3 (optimally in Parquet format) using AWS Glue ETL
- you want to export the data from all table in a single job.
If you are using Glue Studio , you can actually use multiple source nodes in a single jobs and then transform each table and write it out, all the steps will be done sequentially though.
If you are writing your own job in pyspark you could use the Glue API to list the table names in the Glue Catalog and then loop on the list of names and reuse the same code.
Still if you are not going to join the tables in the job it would be better to parametrize the job passing the table name, and then run a workflow or a step-function that call the job with a different runtime parameter (table name) you could run all the exports in parallel.
Finally you could also have the parameter hard coded and have one job by table (no need to code the same job over and over ) using CloudFormation template to deploy the other 25 version of the jobs one the script is finalized.
hope this helps
Relevant content
- asked 3 years ago
- Accepted Answerasked 2 years ago
- Accepted Answerasked a year ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 6 months ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 4 months ago
from the error it seems that one of the column in the sixth table has a data type that is not supported in Parquet ? could you share the schema in Dynamo DB?