How to overcome document (pdf) size limit with Comprehend

Question

Hello community,
I am attempting to implement the tutorial found in the Kendra documentation (p. 245) to create an intelligent search tool, and the first step after storing the data in S3 is leveraging AWS Comprehend entities analysis. I'm using my own data instead of the tutorial, to test a real-world use case, and I'm finding the file size limit to be quite ridiculous* (1MB cap) on pdf or word doc's according to the documentation and the error I first got when attempting for the first time - "SINGLE_FILE_SIZE_LIMIT_EXCEEDED, etc. 1048576 bytes allowed for ONE_DOC_PER_FILE format".

I put an asterisk next to ridiculous because I suppose this is relative, but I would tend to believe that most real-world applications have documents that are larger. Not to mention what the limits are for other operations, like most other asynchronous operations. I'm someone who has some practical programming experience with ML in python, so when attempting to look at the possible work-arounds or solutions a couple of things came to mind - 
* Use the CreateEntityRecognizer API along with python/boto3 SDK - not sure this would work or be any different, according to documentation it appears this would fall under custom entity recognition
* Do my own portion of solutioning in python and use something like a tokenizer - If I'm doing this I might as well do most of my work outside of leveraging any AWS ML platform...
* The KISS approach: simply "chunk" up my pdf doc's so that they are all less than the 1MB cap, ensuring to keep context intact while doing so

Any thoughts, comments, or suggestions are appreciated.

Thanks!

Answer

Amazon Comprehend limits are documented here https://docs.aws.amazon.com/comprehend/latest/dg/guidelines-and-limits.html and indeed for entity detection documents cannot be bigger than 1MB in size. I would suggest you take your latest approach, that is splitting documents in 1MB max chunks and perform entity detection on those. When building the Kendra index you can then aggregate the entities for each chunk and associate them to the original document.

Answer

According to https://docs.aws.amazon.com/comprehend/latest/dg/guidelines-and-limits.html#limits-custom-entity-recognition, the max document size for PDF and Word documents should be 50MB and 5MB, respectively, and the max document size for UTF-8 encoded plain-text documents is 1MB.

How to overcome document (pdf) size limit with Comprehend

Contenuto pertinente