Dealing with large dimensioned, small-data PDFs in Textract

0

I am getting a INVALID_DOCUMENT_TYPE error when trying to process a given PDF with Textract even though the PDF is only 1MB. However, the PDF is about 105"x35" which I know is greater than the allowed quota limit. I had two primary quesitons:

  • Is there a way to get more expressive error returns with Textract? This debugging took me quite awhile to find the size issue as there seems to only be one overarching exception, UnsupportedDocumentException, for these types of errors while there are any possible document quota issues.
  • Are there best practices for splitting up large PDFs within the Textract system? The file has a large amount of white space which causes this size to dimensions variation.
  • Zac
1 Risposta
1
Risposta accettata
  1. I understand that you would like to know if you are able to get more logs from textract. Unfortunately there is limitations with textract logs. What you are currently seeing is all the logging are currently supported. You could also see more info by checking the cloudtrail api calls, you could do this manually by checking the cloudtrail console, or set up logging with cloudwatch to view your cloudtrail logs[1] Usually that error happens when the document does not follow the criteria listed here [2]. Or it could be in cases where the doc is corrupted or encoded incorrectly.

  2. For PDF's with pages greater than 3000, I recommended splitting your PDF into batches so that they fall within the acceptable ranges of pages. I have also provided an external link for a PDF splitter code you can implement [3]. For extra information, for images above 10 MB I recommended that you decrease the resolution of the images until they meet the 10 MB mark. I can recommend OpenCV to achieve this.

Resources: [1] https://docs.aws.amazon.com/textract/latest/dg/logging-using-cloudtrail.html

[2] https://docs.aws.amazon.com/textract/latest/dg/API_Document.html

[3] https://github.com/x4nth055/pythoncode-tutorials/tree/master/handling-pdf-files/split-pdf

AWS
con risposta un anno fa
profile picture
ESPERTO
verificato 2 mesi fa
  • Thanks for the note! Specifically, I was wanting to try see if there was a way to get more granularity on when. document fails which of the criteria it failed on. For my documents I found that it was their overall physical size not the data size, but the error did not offer that specification. Appreciate your response.

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande