Is it possible to maintain the shape of a pdf using textract? and translate docs with translate?

0

Hi, good evening. I would like to ask if there is a way to maintain the visual structure or shape of a pdf file (whether it is a text-only file or with tables) using only the 'ocr' function of the textract service? I would need to translate large quantities of documents that are not always printed well or digitized well later. I tried to do some tests and the text extraction is very precise and using 'Translate', I would be able to speed up the work a lot. So I'd like to ask if there's a way to keep the PDF a bit integrated? Or if i can do it in a second time with some functions?

second question: is it possible to translate documents in PDF or Word format with the translate service?

Thanks in advance for your reply. Btw, happy new year :)

  • hey, may I know if you finally figure it out? I have a similar requirement with you. thanks

질문됨 일 년 전948회 조회
1개 답변
0

Hi, If you want to extract the structure of the document, the best way would be to use the AnalyzeDocument API, it will extract the different relations and structural element such as Table, Key Value pair, ... However if you want to only use the DetectText Apis, you will get the bounding box coordinate for each of the WORD or LINE detected, which you can use to reconstruct the document by placing the text in it's original position. (https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html the information is in Geometry) With this you will just have the text and no information regarding the Table structured or any other information that was previously in the document.

Regarding your second question, Textract doesn't do document conversion, we are extracting text and structure information from the document, but we are not recreating a document similar to the one that you sent.

I hope it helps. Happy New Year to you as well :)

AWS
답변함 일 년 전
  • Not being a developer, it's a bit complicated for me. May I ask where you have to put the Json code? I thought there was a link where you put the pdf file to get the ocr. Thanks for the answer though :)

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠