Dealing with large dimensioned, small-data PDFs in Textract

0

I am getting a INVALID_DOCUMENT_TYPE error when trying to process a given PDF with Textract even though the PDF is only 1MB. However, the PDF is about 105"x35" which I know is greater than the allowed quota limit. I had two primary quesitons:

  • Is there a way to get more expressive error returns with Textract? This debugging took me quite awhile to find the size issue as there seems to only be one overarching exception, UnsupportedDocumentException, for these types of errors while there are any possible document quota issues.
  • Are there best practices for splitting up large PDFs within the Textract system? The file has a large amount of white space which causes this size to dimensions variation.
  • Zac
profile picture
Zac Dan
질문됨 일 년 전255회 조회
1개 답변
1
수락된 답변
  1. I understand that you would like to know if you are able to get more logs from textract. Unfortunately there is limitations with textract logs. What you are currently seeing is all the logging are currently supported. You could also see more info by checking the cloudtrail api calls, you could do this manually by checking the cloudtrail console, or set up logging with cloudwatch to view your cloudtrail logs[1] Usually that error happens when the document does not follow the criteria listed here [2]. Or it could be in cases where the doc is corrupted or encoded incorrectly.

  2. For PDF's with pages greater than 3000, I recommended splitting your PDF into batches so that they fall within the acceptable ranges of pages. I have also provided an external link for a PDF splitter code you can implement [3]. For extra information, for images above 10 MB I recommended that you decrease the resolution of the images until they meet the 10 MB mark. I can recommend OpenCV to achieve this.

Resources: [1] https://docs.aws.amazon.com/textract/latest/dg/logging-using-cloudtrail.html

[2] https://docs.aws.amazon.com/textract/latest/dg/API_Document.html

[3] https://github.com/x4nth055/pythoncode-tutorials/tree/master/handling-pdf-files/split-pdf

AWS
답변함 일 년 전
profile picture
전문가
검토됨 2달 전
  • Thanks for the note! Specifically, I was wanting to try see if there was a way to get more granularity on when. document fails which of the criteria it failed on. For my documents I found that it was their overall physical size not the data size, but the error did not offer that specification. Appreciate your response.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠