2 Risposte
- Più recenti
- Maggior numero di voti
- Maggior numero di commenti
1
Amazon Comprehend limits are documented here https://docs.aws.amazon.com/comprehend/latest/dg/guidelines-and-limits.html and indeed for entity detection documents cannot be bigger than 1MB in size. I would suggest you take your latest approach, that is splitting documents in 1MB max chunks and perform entity detection on those. When building the Kendra index you can then aggregate the entities for each chunk and associate them to the original document.
0
According to https://docs.aws.amazon.com/comprehend/latest/dg/guidelines-and-limits.html#limits-custom-entity-recognition, the max document size for PDF and Word documents should be 50MB and 5MB, respectively, and the max document size for UTF-8 encoded plain-text documents is 1MB.
con risposta 2 anni fa
Contenuto pertinente
- AWS UFFICIALEAggiornata 2 anni fa
- AWS UFFICIALEAggiornata 3 anni fa
That list bit about associating all the "bits" of the broken up documents together when using Kendra was something I was concerned about, so I think that answers that - with Comprehend it seems the ends shall justify the means, if you will. I'll continue testing and see what happens. Thanks for your answer!
*Just wanted to update, the next roadblock is that although Comprehend will accept PDF files, it won't actually produce any metadata with them because they aren't UTF-8 formatted, which I found out the hard way after looking at the "output" file. So now I have to add an extra step to all of this and convert any pdf to UTF-8.