Skip to content

Progressive AWS Textract Multi-Page Processing Failure in Bank Statement Document Analysis Environment

0

I have been trying to process multi-page bank statements asynchronously (tried with 2 and 3 page pdf's) and every time it perfectly detects and extracts the data from page ONLY. It detects the second page, but doesn't detect any tables (which is identical in quality and headings) and doesn't even pick up page 3 at all. I would appreciate some help on this please.

Service: AWS Textract (Async Document Analysis) Features: TABLES + QUERIES Implementation: Node.js with @aws-sdk/client-textract Documents: Bank Statements (multi-page PDFs)

Problem Evolution: Initial Discovery (2-page document):

Page 1: Perfect processing (6 tables, 24 transactions, 953 blocks) Page 2: Table detection failure (0 tables, 47 blocks processed) Impact: 24/44 transactions processed (46% data loss)

Discovery (3-page document):

Page 1: Perfect processing (5 tables, 771 blocks) Page 2-3: Complete processing failure (not returned in response) API Response: "Found 2 pages" for 3-page document Impact: 60-70% transaction data missing

Technical Evidence: Document 1 (2-page) Diagnostic Output: PAGE 1 TABLE DETECTION AUDIT:

  • Total blocks: 953
  • TABLE blocks found: 6
  • Main transaction table: 156 cells, 99.8% confidence, STRUCTURED_TABLE

PAGE 2 TABLE DETECTION AUDIT:

  • Total blocks: 47 ← 95% fewer blocks than Page 1
  • TABLE blocks found: 0 ← Complete detection failure
  • Block types found: PAGE, LINE only Document 2 (3-page) Diagnostic Output: Processing 1000 total blocks from PHASE 2 async Textract Found 2 pages, 405 text lines ← Only 2 pages detected (document has 3) Found 5 tables, 176 cells

PAGE 1 TABLE DETECTION AUDIT:

  • Total blocks: 771
  • TABLE blocks found: 5 [No Page 2 or Page 3 audit output - pages not processed] Implementation Details (Proving Code is Not the Issue): API Call: javascriptconst command = new StartDocumentAnalysisCommand({ DocumentLocation: { S3Object: { Bucket, Name } }, FeatureTypes: ["TABLES", "QUERIES"], QueriesConfig: { Queries: bankStatementQueries } }); Multi-Page Processing Logic: javascript// Code correctly handles unlimited pages const pageMap = new Map(); allBlocks.forEach(block => { if (block.Page) { if (!pageMap.has(block.Page)) pageMap.set(block.Page, []); pageMap.get(block.Page).push(block); } });

// Processes all pages AWS returns for (const [pageNumber, pageBlocks] of pageMap.entries()) { this.auditPageTableDetection(pageBlocks, pageNumber); } Document Characteristics:

Page 1: Complex layout with bank headers, account info, transaction tables Subsequent Pages: Identical table structures, same column headers, transaction continuations File Sizes: Under 200KB, well within AWS limits Visual Quality: High-quality PDF, clear table structures

Current System Performance:

Successfully processes Page 1 with 100% accuracy Perfect data quality (dates, amounts, debit/credit classification) Issue is purely AWS Textract's multi-page processing limitation

Questions:

  1. AWS Textract progressively fail on subsequent pages? 2-page doc: Page 2 gets minimal processing (47 vs 953 blocks) 3-page doc: Pages 2-3 completely missing from response

  2. Is this a known limitation with TABLES + QUERIES feature combination? Does adding QUERIES affect multi-page processing capability? Should TABLES-only be used for multi-page documents?

  3. Are there undocumented limits for async document analysis? Document complexity thresholds Page count limitations Processing timeout issues

4.Why does the API response truncate pages? GetDocumentAnalysisCommand returns "2 pages" for 3-page document Is this a pagination issue in the response parsing?

asked 2 months ago45 views
1 Answer
1

The issues you're experiencing with AWS Textract's multi-page document processing are concerning but not uncommon. Let me address your specific questions:

  1. Progressive failure on subsequent pages: Yes, Textract can sometimes struggle with processing tables across multiple pages. This is a known limitation where the service might fail to recognize or extract tables from pages after the first one, even when they have similar structure and quality. The dramatic drop in block count (953 to 47) indicates that Textract is detecting the page but failing to properly analyze its content.

  2. TABLES + QUERIES feature combination: While there's no explicit documentation stating that combining TABLES and QUERIES features causes multi-page processing issues, it's possible that the complexity of running both analyses simultaneously affects performance. You could try processing with TABLES feature only to see if that improves multi-page detection.

  3. Processing limitations: Textract should be able to handle multi-page PDFs without issue - that's one of its core capabilities. The service is designed for asynchronous processing of large, multi-page documents. However, there are some factors that might affect processing:

  • Table complexity: If your tables have merged cells or unusual formatting
  • Image quality: Even in PDFs, the underlying quality matters
  • Table borders: Faint or unclear borders can affect detection
  1. API response truncation: The fact that your API reports "Found 2 pages" for a 3-page document suggests there might be an issue with how Textract is processing the document itself. This could be due to:
  • Pagination token handling: You might need to ensure you're properly handling the NextToken parameter to get all pages of results
  • Document parsing issues: Textract might be having trouble with the third page's format

For multi-page tables specifically, Textract doesn't automatically merge tables that span across pages - this requires post-processing. However, it should still detect tables on each page independently.

To improve your results, you could try:

  1. Processing with TABLES feature only first to isolate if QUERIES is causing issues
  2. Ensuring your pagination token handling is correct in the GetDocumentAnalysis calls
  3. Checking if there are any quality differences between pages (even subtle ones)
  4. Implementing custom post-processing logic to handle cases where Textract fails to extract all data
    Sources
    Has anyone extracted multi-page pdf tables using the Textract Javascript v3 sdk? | AWS re:Post
    Processing Documents Asynchronously - Amazon Textract
    AWS Textract Missing Crucial Data in a Table | AWS re:Post
answered 2 months ago
EXPERT
reviewed 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.