TEXTRACT: Incorrect Layout response objects results and results not in desired order.

0

Background: I am using Textract Analyze document API to detect Layout response objects in a PDF page. The page has Page Headers, Title, Sub-headers, tables, figures, and text. The page is divided into 3 vertical columns, each having some text and tables. Challenge: I have 2 challenges:

  1. Upon using the Layout option from Analyze document API, Textract can correctly identify about 90% of response objects. Some Sub-headers are identified as Text, and sometimes sub-headers are identified as a part of the Table. How can I train my model to identify the response objects correctly?
  2. The order in which these Layout response objects are being presented is completely wrong. Eg. I first want all the response objects of column 1 to be presented followed by that of Column 2 and so on. Is there a way by which I can train the Textract to first identify and print the objects from Column 1 then followed by Column 2?

I am attaching some snippets to better understand my challenges:

Enter image description here Enter image description here

질문됨 한 달 전144회 조회
1개 답변
0
수락된 답변

Using bounding boxes might be helpful. You should try the Textractor Package (amazon-textract-overlayer)

AWS
JoeWil
답변함 한 달 전
profile picture
전문가
검토됨 한 달 전
  • Thanks for your answer. Yes, I have been trying that, using bounding boxes to identify the x-min and y-min of response objects and then trying to devise a way to order them. But, the challenge is even using the x-min coordinate, I am not able to differentiate which response objects fall in Column 1, Column 2, or Column 3 of the page. In the output, I have to first order all the objects of column 1, with an increasing value of y-min, followed by that of column 2, and so on. Is there any way or algorithm you can think of to help me achieve this?

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠