Dealing with large dimensioned, small-data PDFs in Textract

0

I am getting a INVALID_DOCUMENT_TYPE error when trying to process a given PDF with Textract even though the PDF is only 1MB. However, the PDF is about 105"x35" which I know is greater than the allowed quota limit. I had two primary quesitons:

  • Is there a way to get more expressive error returns with Textract? This debugging took me quite awhile to find the size issue as there seems to only be one overarching exception, UnsupportedDocumentException, for these types of errors while there are any possible document quota issues.
  • Are there best practices for splitting up large PDFs within the Textract system? The file has a large amount of white space which causes this size to dimensions variation.
  • Zac
profile picture
Zac Dan
已提問 1 年前檢視次數 255 次
1 個回答
1
已接受的答案
  1. I understand that you would like to know if you are able to get more logs from textract. Unfortunately there is limitations with textract logs. What you are currently seeing is all the logging are currently supported. You could also see more info by checking the cloudtrail api calls, you could do this manually by checking the cloudtrail console, or set up logging with cloudwatch to view your cloudtrail logs[1] Usually that error happens when the document does not follow the criteria listed here [2]. Or it could be in cases where the doc is corrupted or encoded incorrectly.

  2. For PDF's with pages greater than 3000, I recommended splitting your PDF into batches so that they fall within the acceptable ranges of pages. I have also provided an external link for a PDF splitter code you can implement [3]. For extra information, for images above 10 MB I recommended that you decrease the resolution of the images until they meet the 10 MB mark. I can recommend OpenCV to achieve this.

Resources: [1] https://docs.aws.amazon.com/textract/latest/dg/logging-using-cloudtrail.html

[2] https://docs.aws.amazon.com/textract/latest/dg/API_Document.html

[3] https://github.com/x4nth055/pythoncode-tutorials/tree/master/handling-pdf-files/split-pdf

AWS
已回答 1 年前
profile picture
專家
已審閱 2 個月前
  • Thanks for the note! Specifically, I was wanting to try see if there was a way to get more granularity on when. document fails which of the criteria it failed on. For my documents I found that it was their overall physical size not the data size, but the error did not offer that specification. Appreciate your response.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南