- Newest
- Most votes
- Most comments
Hi! It looks like when it is creating the custom vocabulary it's passing on to Transcribe incorrectly formatted words. I checked out the lambda function in charge of creating the vocabulary and in the example feed it is taking in the following terms
["-Hawn", "Cloud", "A-W-S-A-I-Services", "Amazon", "Code-Whisperer", "Pillir", "A-W-S", "S-A-P", "U-K-T-V", "Media-two-Cloud", "Marketplace", "Mainframe-Modernization", "E-M-R-Serverless", "E-M-R", "Apache", "Spark", "Hive", "Hawn", "Amazon-Connect", "Local-Measure", "low", "Intelligent-Automation"]
It seems that when creating a custom vocabulary "-word" (in this case -Hawn, is creating the issue) is not accepted, so the lambda function in charge of doing the preprocessing should be reviewed --> podcast-transcribe-index-createTranscribeVocabular***
Hope this helps!
Interesting that you found
-Hawn
to be the problem. Using this link provided by @iona Ekonomi https://docs.aws.amazon.com/transcribe/latest/dg/charsets.html#char-english , it seems hyphens are allowed in the English character sets.Does this mean that hyphens at the beginning of a word are not acceptable? Also, could you let me know how you found out that word was the one causing the issue?
It looks like it is not acceptable at the start of the word, will be commenting this internally. My troubleshooting for this was: -> Modify the lambda to print out the vocabulary phrases list -> Once I got the results, I tried creating a custom vocabulary to see where it was failing. -> -Hawn was the only word that i could imagine to make the training fail hehe, and so it was!
(Please accept the answer if it has helped you ^_^ )
When you create the custom vocabulary for Transcribe you have to check the characters that you can use for different languages. Here you can fine more details : https://docs.aws.amazon.com/transcribe/latest/dg/charsets.html.
For the specific language that you are using, you can check how you can deal with special characters.
Relevant content
- asked 2 years ago
- asked 2 years ago
- Accepted Answerasked 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated a month ago
With @Dani Mitchells help, I found out the problem was caused because transcribe doesn't accept words starting with
-
when creating its custom vocabulary.Solved it by adding an extra check for this case in the
podcast-transcribe-index-createTranscribeVocabular
function.