Is it possible to maintain the shape of a pdf using textract? and translate docs with translate?

0

Hi, good evening. I would like to ask if there is a way to maintain the visual structure or shape of a pdf file (whether it is a text-only file or with tables) using only the 'ocr' function of the textract service? I would need to translate large quantities of documents that are not always printed well or digitized well later. I tried to do some tests and the text extraction is very precise and using 'Translate', I would be able to speed up the work a lot. So I'd like to ask if there's a way to keep the PDF a bit integrated? Or if i can do it in a second time with some functions?

second question: is it possible to translate documents in PDF or Word format with the translate service?

Thanks in advance for your reply. Btw, happy new year :)

  • hey, may I know if you finally figure it out? I have a similar requirement with you. thanks

asked a year ago903 views
1 Answer
0

Hi, If you want to extract the structure of the document, the best way would be to use the AnalyzeDocument API, it will extract the different relations and structural element such as Table, Key Value pair, ... However if you want to only use the DetectText Apis, you will get the bounding box coordinate for each of the WORD or LINE detected, which you can use to reconstruct the document by placing the text in it's original position. (https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html the information is in Geometry) With this you will just have the text and no information regarding the Table structured or any other information that was previously in the document.

Regarding your second question, Textract doesn't do document conversion, we are extracting text and structure information from the document, but we are not recreating a document similar to the one that you sent.

I hope it helps. Happy New Year to you as well :)

AWS
answered a year ago
  • Not being a developer, it's a bit complicated for me. May I ask where you have to put the Json code? I thought there was a link where you put the pdf file to get the ocr. Thanks for the answer though :)

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions