TEXTRACT: Incorrect Layout response objects results and results not in desired order.

0

Background: I am using Textract Analyze document API to detect Layout response objects in a PDF page. The page has Page Headers, Title, Sub-headers, tables, figures, and text. The page is divided into 3 vertical columns, each having some text and tables. Challenge: I have 2 challenges:

  1. Upon using the Layout option from Analyze document API, Textract can correctly identify about 90% of response objects. Some Sub-headers are identified as Text, and sometimes sub-headers are identified as a part of the Table. How can I train my model to identify the response objects correctly?
  2. The order in which these Layout response objects are being presented is completely wrong. Eg. I first want all the response objects of column 1 to be presented followed by that of Column 2 and so on. Is there a way by which I can train the Textract to first identify and print the objects from Column 1 then followed by Column 2?

I am attaching some snippets to better understand my challenges:

Enter image description here Enter image description here

已提問 1 個月前檢視次數 144 次
1 個回答
0
已接受的答案

Using bounding boxes might be helpful. You should try the Textractor Package (amazon-textract-overlayer)

AWS
JoeWil
已回答 1 個月前
profile picture
專家
已審閱 1 個月前
  • Thanks for your answer. Yes, I have been trying that, using bounding boxes to identify the x-min and y-min of response objects and then trying to devise a way to order them. But, the challenge is even using the x-min coordinate, I am not able to differentiate which response objects fall in Column 1, Column 2, or Column 3 of the page. In the output, I have to first order all the objects of column 1, with an increasing value of y-min, followed by that of column 2, and so on. Is there any way or algorithm you can think of to help me achieve this?

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南