AWS Comprehend training data csv decoding error : Bad data: b'\x96'., exit code: 1

0

Error message: The file webform-training-list-utf-csv-note.csv could not be decoded as valid utf-8 at position 6357 to 6358. Bad data: b'\x96'., exit code: 1

There are characters and linebreaks in my training data I removed "," and "|" Is there any other thing i have to watch out for when preparing data ? Any characters to remove or any other required?

asked 20 days ago58 views
2 Answers
2
Accepted Answer

Hi,

b'\x96' is not a valid utf-8 encoded character. Hence the error message as you specified that your file is utf-8 encoded

b'\x96' is dash ('-') in latin1: so, you may want to say to comprehend that you file is latin1 instead of utf-8.

Best,

Didier

profile pictureAWS
EXPERT
answered 20 days ago
profile picture
EXPERT
reviewed 20 days ago
  • the required format for AWS comprehend is CSV UTF-8 I tried to (1) remove all '-' , but still get same error message I tried to save as a UTF-8 file but causes some corruption of the file any other advise how to deal with this?

    I'm analyzing comments left the form enquiry . I'm trying to train a model then run asynchronous analysis of a larger dataset.

    • which is another large csv with possibly more "non-UTF" data
0

Thanks for the quick response, awesome! Are there any formatting guidelines for CSV that we can follow like removing these symbols?

answered 20 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions