Is it possible to maintain the shape of a pdf using textract? and translate docs with translate?

Question

Hi, good evening. I would like to ask if there is a way to maintain the visual structure or shape of a pdf file (whether it is a text-only file or with tables) using only the 'ocr' function of the textract service? I would need to translate large quantities of documents that are not always printed well or digitized well later. I tried to do some tests and the text extraction is very precise and using 'Translate', I would be able to speed up the work a lot. So I'd like to ask if there's a way to keep the PDF a bit integrated? Or if i can do it in a second time with some functions?

second question: is it possible to translate documents in PDF or Word format with the translate service?

Thanks in advance for your reply. Btw, happy new year :)

second question: is it possible to translate documents in PDF or Word format with the translate service?

Thanks in advance for your reply.
Btw, happy new year :)

Answer

Hi,
If you want to extract the structure of the document, the best way would be to use the AnalyzeDocument API, it will extract the different relations and structural element such as Table, Key Value pair, ...
However if you want to only use the DetectText Apis, you will get the bounding box coordinate for each of the WORD or LINE detected, which you can use to reconstruct the document by placing the text in it's original position. (https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html the information is in `Geometry`) With this you will just have the text and no information regarding the Table structured or any other information that was previously in the document.

Regarding your second question, Textract doesn't do document conversion, we are extracting text and structure information from the document, but we are not recreating a document similar to the one that you sent.

I hope it helps.
Happy New Year to you as well :)

Is it possible to maintain the shape of a pdf using textract? and translate docs with translate?

Relevant content