What is the simplest way to extract all 26 tables from a single DynamoDB db into AWS Glue Catalog?

0

My source db is DynamoDB. It has 26 tables. I am trying to build AWS QuickSight reports using AWS Athena that builds the specific views for said reports. however, I seem to only be able to select a single table in creating the Glue job despite being able to select all tables i need for the crawler of the entire DB from Dynamo. What is the simplest route to get a complete extract of all tables that is queryable in Athena.

I dont want to connect the reports direct to dynamoDB as it s a production database and want to create some separation to avoid any performance degradation by a poor query etc. How to extract these tables?

Oleg
asked 2 years ago975 views
2 Answers
0

Thnx, by the way: I do ETL DynamoDB->S3 , 5 tables transformed ok, but on the 6th I received: An error occurred while calling o397.pyWriteDynamicFrame. Unsupported case of DataType: com.amazonaws.services.glue.schema.types.StringType@4a2a46aa and DynamicNode: longnode. What can be the reason? where to check?

That is the log

2022-03-21 15:13:32,301 ERROR [Executor task launch worker for task 5.0 in stage 13.0 (TID 132)] util.Utils (Logging.scala:logError(94)): Aborting task java.lang.RuntimeException: Unsupported case of DataType: com.amazonaws.services.glue.schema.types.StringType@4a4c524e and DynamicNode: longnode. at scala.sys.package$.error(package.scala:30) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.writeField(ParquetWriteSupport.scala:144) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.$anonfun$writeFields$2(ParquetWriteSupport.scala:82) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.consumeField(ParquetWriteSupport.scala:322) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.$anonfun$writeFields$1(ParquetWriteSupport.scala:82) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.$anonfun$writeFields$1$adapted(ParquetWriteSupport.scala:78) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.writeFields(ParquetWriteSupport.scala:78) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.$anonfun$write$1(ParquetWriteSupport.scala:70) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:310) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:70) at com.amazonaws.services.glue.writers.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:32) at org.apache.parquet.nimble.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:161) at org.apache.parquet.nimble.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:87) at com.amazonaws.services.glue.writers.parquet.ParquetWriter.write(ParquetWriter.scala:157) at com.amazonaws.services.glue.writers.DynamicRecordWriter.safeWrite(DynamicRecordWriter.scala:52) at com.amazonaws.services.glue.writers.parquet.ParquetWriterFactory$$anon$2.write(ParquetWriter.scala:239) at com.amazonaws.services.glue.writers.parquet.ParquetWriterFactory$$anon$2.write(ParquetWriter.scala:238) at com.amazonaws.services.glue.sinks.GlueParquetHadoopWriter.$anonfun$writeParquetNotPartitioned$1(GlueParquetHadoopWriter.scala:53) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473) at org.apache.spark.sql.glue.SparkUtility$.tryWithSafeFinallyAndFailureCallbacks(SparkUtility.scala:39) at com.amazonaws.services.glue.sinks.GlueParquetHadoopWriter.writeParquetNotPartitioned(GlueParquetHadoopWriter.scala:64) at com.amazonaws.services.glue.sinks.GlueParquetHadoopWriter.$anonfun$doParquetWrite$1(GlueParquetHadoopWriter.scala:179) at com.amazonaws.services.glue.sinks.GlueParquetHadoopWriter.$anonfun$doParquetWrite$1$adapted(GlueParquetHadoopWriter.scala:178) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Oleg
answered 2 years ago
  • from the error it seems that one of the column in the sixth table has a data type that is not supported in Parquet ? could you share the schema in Dynamo DB?

0

To clarify the question:

  1. you have already crawled your DynamoDB
  2. you have already 26 tables in the Glue Catalog pointing directly to DynamoDB
  3. you do not want to access DynamoDB through Athena
  4. you want to export the data to S3 (optimally in Parquet format) using AWS Glue ETL
  5. you want to export the data from all table in a single job.

If you are using Glue Studio , you can actually use multiple source nodes in a single jobs and then transform each table and write it out, all the steps will be done sequentially though.

If you are writing your own job in pyspark you could use the Glue API to list the table names in the Glue Catalog and then loop on the list of names and reuse the same code.

Still if you are not going to join the tables in the job it would be better to parametrize the job passing the table name, and then run a workflow or a step-function that call the job with a different runtime parameter (table name) you could run all the exports in parallel.

Finally you could also have the parameter hard coded and have one job by table (no need to code the same job over and over ) using CloudFormation template to deploy the other 25 version of the jobs one the script is finalized.

hope this helps

AWS
EXPERT
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions