- Newest
- Most votes
- Most comments
The issues you're experiencing with AWS Textract's multi-page document processing are concerning but not uncommon. Let me address your specific questions:
-
Progressive failure on subsequent pages: Yes, Textract can sometimes struggle with processing tables across multiple pages. This is a known limitation where the service might fail to recognize or extract tables from pages after the first one, even when they have similar structure and quality. The dramatic drop in block count (953 to 47) indicates that Textract is detecting the page but failing to properly analyze its content.
-
TABLES + QUERIES feature combination: While there's no explicit documentation stating that combining TABLES and QUERIES features causes multi-page processing issues, it's possible that the complexity of running both analyses simultaneously affects performance. You could try processing with TABLES feature only to see if that improves multi-page detection.
-
Processing limitations: Textract should be able to handle multi-page PDFs without issue - that's one of its core capabilities. The service is designed for asynchronous processing of large, multi-page documents. However, there are some factors that might affect processing:
- Table complexity: If your tables have merged cells or unusual formatting
- Image quality: Even in PDFs, the underlying quality matters
- Table borders: Faint or unclear borders can affect detection
- API response truncation: The fact that your API reports "Found 2 pages" for a 3-page document suggests there might be an issue with how Textract is processing the document itself. This could be due to:
- Pagination token handling: You might need to ensure you're properly handling the NextToken parameter to get all pages of results
- Document parsing issues: Textract might be having trouble with the third page's format
For multi-page tables specifically, Textract doesn't automatically merge tables that span across pages - this requires post-processing. However, it should still detect tables on each page independently.
To improve your results, you could try:
- Processing with TABLES feature only first to isolate if QUERIES is causing issues
- Ensuring your pagination token handling is correct in the GetDocumentAnalysis calls
- Checking if there are any quality differences between pages (even subtle ones)
- Implementing custom post-processing logic to handle cases where Textract fails to extract all data
Sources
Has anyone extracted multi-page pdf tables using the Textract Javascript v3 sdk? | AWS re:Post
Processing Documents Asynchronously - Amazon Textract
AWS Textract Missing Crucial Data in a Table | AWS re:Post
Relevant content
- asked 4 years ago
- asked a month ago
- asked a year ago
- asked 3 years ago
