Amazon Transcribe Custom Vocabulary ERROR: invalid characters or incorrectly formatted terms

0

Source Link: https://github.com/aws-samples/amazon-transcribe-comprehend-podcast

Steps:

  1. Deployed the stack using the lanch stack button in the sources README. Used default configurations
  2. Followed the README to start execution in RSS Step Functions State Machine with the example input JSON file.

Errors:

  1. The step function failed with the message Fail state executed in step: Processing Error When I investigated amazon transcribe, I found out that a Custom vocabulary job failure occurred. And the failure explanation is Failure reason The vocabulary that you’re trying to create contains invalid characters or incorrectly formatted terms. See the developer guide for more information.

It's not clear what to do now. Wondering how to go about diagnosing the specific issue with the transcribe job.

  • With @Dani Mitchells help, I found out the problem was caused because transcribe doesn't accept words starting with - when creating its custom vocabulary.

    Solved it by adding an extra check for this case in the podcast-transcribe-index-createTranscribeVocabular function.

    mapping[item] = origItem
    #### check for words starting with '-'
    if item[0] == "-":
            item = item[1:]
    ###
    vocabularyTerms.append(item)
2 Answers
1
Accepted Answer

Hi! It looks like when it is creating the custom vocabulary it's passing on to Transcribe incorrectly formatted words. I checked out the lambda function in charge of creating the vocabulary and in the example feed it is taking in the following terms

["-Hawn", "Cloud", "A-W-S-A-I-Services", "Amazon", "Code-Whisperer", "Pillir", "A-W-S", "S-A-P", "U-K-T-V", "Media-two-Cloud", "Marketplace", "Mainframe-Modernization", "E-M-R-Serverless", "E-M-R", "Apache", "Spark", "Hive", "Hawn", "Amazon-Connect", "Local-Measure", "low", "Intelligent-Automation"]

It seems that when creating a custom vocabulary "-word" (in this case -Hawn, is creating the issue) is not accepted, so the lambda function in charge of doing the preprocessing should be reviewed --> podcast-transcribe-index-createTranscribeVocabular***

Hope this helps!

AWS
Dani M
answered 2 years ago
  • Interesting that you found -Hawn to be the problem. Using this link provided by @iona Ekonomi https://docs.aws.amazon.com/transcribe/latest/dg/charsets.html#char-english , it seems hyphens are allowed in the English character sets.

    Does this mean that hyphens at the beginning of a word are not acceptable? Also, could you let me know how you found out that word was the one causing the issue?

  • It looks like it is not acceptable at the start of the word, will be commenting this internally. My troubleshooting for this was: -> Modify the lambda to print out the vocabulary phrases list -> Once I got the results, I tried creating a custom vocabulary to see where it was failing. -> -Hawn was the only word that i could imagine to make the training fail hehe, and so it was!

    (Please accept the answer if it has helped you ^_^ )

1

When you create the custom vocabulary for Transcribe you have to check the characters that you can use for different languages. Here you can fine more details : https://docs.aws.amazon.com/transcribe/latest/dg/charsets.html.
For the specific language that you are using, you can check how you can deal with special characters.

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions