Is it possible to maintain the shape of a pdf using textract? and translate docs with translate?

0

Hi, good evening. I would like to ask if there is a way to maintain the visual structure or shape of a pdf file (whether it is a text-only file or with tables) using only the 'ocr' function of the textract service? I would need to translate large quantities of documents that are not always printed well or digitized well later. I tried to do some tests and the text extraction is very precise and using 'Translate', I would be able to speed up the work a lot. So I'd like to ask if there's a way to keep the PDF a bit integrated? Or if i can do it in a second time with some functions?

second question: is it possible to translate documents in PDF or Word format with the translate service?

Thanks in advance for your reply. Btw, happy new year :)

  • hey, may I know if you finally figure it out? I have a similar requirement with you. thanks

已提问 1 年前946 查看次数
1 回答
0

Hi, If you want to extract the structure of the document, the best way would be to use the AnalyzeDocument API, it will extract the different relations and structural element such as Table, Key Value pair, ... However if you want to only use the DetectText Apis, you will get the bounding box coordinate for each of the WORD or LINE detected, which you can use to reconstruct the document by placing the text in it's original position. (https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html the information is in Geometry) With this you will just have the text and no information regarding the Table structured or any other information that was previously in the document.

Regarding your second question, Textract doesn't do document conversion, we are extracting text and structure information from the document, but we are not recreating a document similar to the one that you sent.

I hope it helps. Happy New Year to you as well :)

AWS
已回答 1 年前
  • Not being a developer, it's a bit complicated for me. May I ask where you have to put the Json code? I thought there was a link where you put the pdf file to get the ocr. Thanks for the answer though :)

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则

相关内容