TEXTRACT: Incorrect Layout response objects results and results not in desired order.

0

Background: I am using Textract Analyze document API to detect Layout response objects in a PDF page. The page has Page Headers, Title, Sub-headers, tables, figures, and text. The page is divided into 3 vertical columns, each having some text and tables. Challenge: I have 2 challenges:

  1. Upon using the Layout option from Analyze document API, Textract can correctly identify about 90% of response objects. Some Sub-headers are identified as Text, and sometimes sub-headers are identified as a part of the Table. How can I train my model to identify the response objects correctly?
  2. The order in which these Layout response objects are being presented is completely wrong. Eg. I first want all the response objects of column 1 to be presented followed by that of Column 2 and so on. Is there a way by which I can train the Textract to first identify and print the objects from Column 1 then followed by Column 2?

I am attaching some snippets to better understand my challenges:

Enter image description here Enter image description here

質問済み 1ヶ月前144ビュー
1回答
0
承認された回答

Using bounding boxes might be helpful. You should try the Textractor Package (amazon-textract-overlayer)

AWS
JoeWil
回答済み 1ヶ月前
profile picture
エキスパート
レビュー済み 1ヶ月前
  • Thanks for your answer. Yes, I have been trying that, using bounding boxes to identify the x-min and y-min of response objects and then trying to devise a way to order them. But, the challenge is even using the x-min coordinate, I am not able to differentiate which response objects fall in Column 1, Column 2, or Column 3 of the page. In the output, I have to first order all the objects of column 1, with an increasing value of y-min, followed by that of column 2, and so on. Is there any way or algorithm you can think of to help me achieve this?

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ