UNCLASSIFIED_ERROR when run GLUE Visual ETL

0

I'm running a Visual ETL job under Glue service. I'm testing that service thru visual editor and I stated thru datasource pointing to some DynamoDB table (before I made a crawler, run it then I aws able to see the table created in glue). What I'm looking for is move those records from Dynamodb to S3 taking some specific fields (not all the fields in dynamodb). The automation is as follow: Datasource -> Transform (drop some fields) -> TRansform (executing a SQL) -> Datatarget S3 (save those records to S3, json format with compression GZIP). When I'm in the first step, looking data preview, the followng error appear: "Data preview failure.'". When I run the ETL, it stopped with the following error "Error Category: UNCLASSIFIED_ERROR; An error occurred while calling o98.getCatalogSource. line 1:549 token recognition error at: 'é'" Can someone help me to explain what is the specific error? I know there is a special character but I think if I drop some fields in the step #2, that error would disapear. How can I know the specific record that has this special character in dynamodb?

The stack is as follow:

Traceback (most recent call last):

File "/opt/amazon/lib/python3.10/site-packages/awsglue/dynamicframe.py", line 632, in from_catalog return self._glue_context.create_dynamic_frame_from_catalog(db, table_name, redshift_tmp_dir, transformation_ctx, push_down_predicate, additional_options, catalog_id, **kwargs)

File "/opt/amazon/lib/python3.10/site-packages/awsglue/context.py", line 192, in create_dynamic_frame_from_catalog source = DataSource(self._ssql_ctx.getCatalogSource(db, table_name, redshift_tmp_dir, transformation_ctx,

File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ return_value = get_return_value(

File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco return f(*a, **kw)

File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError(

ERROR F22

Py4JJavaError - An error occurred while calling o97.getCatalogSource. : 
org.antlr.v4.runtime.misc.ParseCancellationException: line 1:549 token recognition error at: 'é' at com.amazonaws.services.glue.schema.io.ThrowingErrorListener.syntaxError(ThrowingErrorListener.java:15)
at org.antlr.v4.runtime.ProxyErrorListener.syntaxError(ProxyErrorListener.java:41)
at org.antlr.v4.runtime.Lexer.notifyListeners(Lexer.java:364)
at org.antlr.v4.runtime.Lexer.nextToken(Lexer.java:144)
at org.antlr.v4.runtime.BufferedTokenStream.fetch(BufferedTokenStream.java:169)
at org.antlr.v4.runtime.BufferedTokenStream.sync(BufferedTokenStream.java:152)
at org.antlr.v4.runtime.BufferedTokenStream.consume(BufferedTokenStream.java:136)
at org.antlr.v4.runtime.Parser.consume(Parser.java:571)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.forgivingIdentifier(HiveSchemaParser.java:1128)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.colType(HiveSchemaParser.java:937)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.colTypeList(HiveSchemaParser.java:693)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.structType(HiveSchemaParser.java:475)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.dataType(HiveSchemaParser.java:175)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.colType(HiveSchemaParser.java:941)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.colTypeList(HiveSchemaParser.java:693)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.structType(HiveSchemaParser.java:475)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.dataType(HiveSchemaParser.java:175)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.colType(HiveSchemaParser.java:941)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.colTypeList(HiveSchemaParser.java:693)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.structType(HiveSchemaParser.java:475)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.dataType(HiveSchemaParser.java:175)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.colType(HiveSchemaParser.java:941)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.colTypeList(HiveSchemaParser.java:683)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.structType(HiveSchemaParser.java:475)
at com.amazonaws.services.glue.schema.io.grammar.HiveSchemaParser.dataType(HiveSchemaParser.java:175)
at com.amazonaws.services.glue.schema.io.HiveFormatDeserializer.deserializeDataType(HiveFormatDeserializer.java:52)
at com.amazonaws.services.glue.schema.io.HiveFormatDeserializer.deserializeDataTypeFromString(HiveFormatDeserializer.java:63)
at com.amazonaws.services.glue.util.DataCatalogWrapperUtils.$anonfun$getFieldsFromColumns$1(DataCatalogWrapper.scala:541)
at scala.collection.immutable.List.map(List.scala:297)
at com.amazonaws.services.glue.util.DataCatalogWrapperUtils.getFieldsFromColumns(DataCatalogWrapper.scala:540)
at com.amazonaws.services.glue.util.DataCatalogWrapperUtils.getFieldsFromColumns$(DataCatalogWrapper.scala:540)
at com.amazonaws.services.glue.util.DataCatalogWrapper.getFieldsFromColumns(DataCatalogWrapper.scala:166)
at com.amazonaws.services.glue.util.DataCatalogWrapperUtils.getSchema(DataCatalogWrapper.scala:545)
at com.amazonaws.services.glue.util.DataCatalogWrapperUtils.getSchema$(DataCatalogWrapper.scala:543)
at com.amazonaws.services.glue.util.DataCatalogWrapper.getSchema(DataCatalogWrapper.scala:166)
at com.amazonaws.services.glue.util.DataCatalogWrapperUtils.catalogTableFromGlueTable(DataCatalogWrapper.scala:940)
at com.amazonaws.services.glue.util.DataCatalogWrapperUtils.catalogTableFromGlueTable$(DataCatalogWrapper.scala:903)
at com.amazonaws.services.glue.util.DataCatalogWrapper.catalogTableFromGlueTable(DataCatalogWrapper.scala:166)
at com.amazonaws.services.glue.util.DataCatalogWrapper.$anonfun$getTable$1(DataCatalogWrapper.scala:216)
at scala.util.Try$.apply(Try.scala:213)
at com.amazonaws.services.glue.util.DataCatalogWrapper.getTable(DataCatalogWrapper.scala:170)
at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:279)
at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:260)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)



cfabres
asked 2 months ago558 views
2 Answers
2

Hello,

Thank you very much for your question. Based on the information provided, it seems that the error is occurring due to the presence of a special character (in this case, the character 'é') in one of the records in your DynamoDB table. The Glue service is unable to parse or recognize this character, which is causing the error during the data preview and the ETL job execution.

In order to understand the error message better, you can read the following information:

  1. "Data preview failure.": This error occurs when Glue is unable to generate a preview of the data from the source (DynamoDB table in your case). This could be due to various reasons, such as data format issues, special characters, or other compatibility problems.

  2. "Error Category: UNCLASSIFIED_ERROR; An error occurred while calling o98.getCatalogSource. line 1:549 token recognition error at: 'é'": This error message indicates that Glue encountered an issue while trying to read or process the data from the DynamoDB table. Specifically, it was unable to recognize the token (character) 'é' at line 1, position 549 of the data.

To address this issue and solve it, you have a few options:

  1. Identify the specific record with the special character: Since you're working with a DynamoDB table, you can use the AWS Management Console, AWS CLI, or an SDK to scan the table and identify the record(s) containing the special character 'é'. Once you have the record(s), you can either remove or replace the special character(s) in the DynamoDB table itself.

  2. Use a Transform step to clean the data: If modifying the data in the DynamoDB table is not an option, you can add a Transform step in your Glue ETL job to clean or remove the special characters from the data before writing it to S3. You can use Glue's built-in functions or write custom Scala or Python code to handle the data cleaning.

In any case, to identify the specific record or records with the special character in your DynamoDB table, you can use the AWS Management Console, AWS CLI, or an SDK (e.g., Python's boto3 library). Here's an example of how you can scan the DynamoDB table using the AWS CLI:

aws dynamodb scan --table-name <your-table-name> --filter-expression "contains(#field, :value)" --expression-attribute-names '{"#field": "<field-name>"}' --expression-attribute-values '{":value": {"S": "é"}}' --output text

Replace <your-table-name> with the name of your DynamoDB table, and <field-name> with the name of the field where you suspect the special character might be present. This command will scan the table and return any records containing the character 'é' in the specified field.

Once you have identified the problematic records, you can take appropriate action to clean or remove the special characters before proceeding with the ETL job.

AWS
answered 2 months ago
0

Thanks for your reply, the dynamodb command helped me a lot to find the records.

cfabres
answered 2 months ago
  • Thank you for your response, hope it helped. If possible, please accept the answer to help other users and gain visibility! :)

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions