Has anyone extracted multi-page pdf tables using the Textract Javascript v3 sdk?

0

I've been trying to extract tables asynchronously from a multi-page pdf file, however it appears that only the table on the first page is returned. I successfully receive a jobID and pass the nextToken into the subsequent call to GetDocumentAnalysisCommand. The Textract demo in the console appears to extract tables from all the pages. I'm not sure if it's possible with the javascript v3 sdk with Node js or there's an issue with my code;

Here's some of the code. Please advise regarding this and if the code is in order.

  let waitTime = 0;
          const getJob = async () => {
            const { Messages } = await sqsClient
              .send(
                new ReceiveMessageCommand({
                  QueueUrl: SNSFunc.sqsQueueUrl,
                  MaxNumberOfMessages: 1,
                })
              )
              .catch((err) => console.log(err));
            if (Messages) {
              console.log(`Message[0]: ${Messages[0].Body}`);

              if (
                JSON.parse(JSON.parse(Messages[0].Body).Message).Status ===
                JobStatus.SUCCEEDED
              ) {
                var maxResults = 1000;
                var paginationToken = null;
                var finished = false;

                while (finished == false) {
                  var response = null;

                  if (paginationToken == null) {
                    response = AWS.send(
                      new GetDocumentAnalysisCommand({
                        JobId: JobIDFunc,
                        MaxResults: maxResults,
                      })
                    ).catch((err) => console.log(err));
                  } else {
                    response = AWS.send(
                      new GetDocumentAnalysisCommand({
                        JobId: JobIDFunc,
                        MaxResults: maxResults,
                        NextToken: paginationToken,
                      })
                    ).catch((err) => console.log(err));
                  }

                  let nextToken = await response.NextToken;

                  if (nextToken) {
                    paginationToken = nextToken;
                  } else {
                    finished = true;
                  }
                }

                async function main() {
                  const blocksVal = await response;

                  const tableCsv = await getTableCsvResults(blocksVal);
            
                return tableCsv;
                }

                let promise10 = await main();

                return promise10;
          
              } else {
                const tick = 5000;
                waitTime += tick;
                console.log(
                  `Waited ${waitTime / 1000} seconds. No messages yet.`
                );
                setTimeout(getJob, tick);
                return;
              }
            }
            return await getJob();
          };
asked 4 months ago151 views
2 Answers
0

The blog post Didier linked is great, and demonstrates in particular that merging tables detected between tables is something you need to do in post-processing - Textract won't do it for you.

However it sounds like your problem might be more straightforward - just not able to iterate through all tables detected in multi-page documents? I can definitely confirm Textract returns these. From your code snippet I'd guess this is a bug somewhere, most likely in the getTableCsvResults function that isn't shared?

You might be interested in our Amazon Textract Response Library for JS library to simplify your code parsing & navigating the returned blocks. That has built-in iterators like:

doc.listPages().forEach((page) => {
  const tables = page.listTables();
});

While the TRP library for Python already implements inter-page table merging, we don't have it yet in TRP.js... But if you're interested in that please do raise it as a feature request on the GitHub!

AWS
EXPERT
answered 4 months ago
  • Thanks for your answers. I found the solution. I used the push method to append paginated responses to an array and it worked.

0

Hi,

I think that you want to read this blog detailing how to handle multi-page tables with AWS Textract: https://aws.amazon.com/blogs/machine-learning/postprocessing-with-amazon-textract-multi-page-table-handling/

Best

Didier

profile pictureAWS
EXPERT
answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions