Textrac returns JSON with only the first 2 pages

0

Hello. We are trying to develop an app for which we want to use Textract to perform OCR on documents, but when uploading PDF documents to a bucket via the API it returns a JSON file with only the first 2 pages of a document that has more than 30... My question is, is this happening because I am still within the 3-month trial period? If so, I want to pay for the service to unlock that restriction, but I haven't found where to make the change. Or maybe The problem is another... Estoy usando un depósito S3 para cargar el PDF antes y luego procesarlo desde allí con start_document_text_detection y luego get_document_text_detection... Thanks

asked 7 months ago303 views
3 Answers
1
Accepted Answer

Hi Carlos,

When we call "get_document_text_detection", it returns paginated results, along with "NextToken".
We can use the "NextToken" to iteratively call and fetch the rest of the parts of the results. [1]

Please have a look at [2] and [3], for referring to the part on how to use the "NextToken".
Example Code Snippet:

def getJobResults(jobId):
    pages = []
    client = boto3.client('textract')

    response = client.get_document_text_detection(JobId=jobId) 
    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']

    while(nextToken):
        response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)
        pages.append(response)
        print("Resultset page recieved: {}".format(len(pages)))
        nextToken = None
        if('NextToken' in response):
            nextToken = response['NextToken']
    return pages

References:
[1] https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentTextDetection.html
[2] https://github.com/aws-samples/amazon-textract-serverless-large-scale-document-processing/blob/master/src/jobresultsproc.py
[3] https://medium.com/petabytz/automatically-extract-data-using-aws-textract-7a599b80b92

profile picture
answered 7 months ago
0

Thanks!!! Works perfectly!!!

answered 7 months ago
0

And, for reference, if you use the OutputConfig in asynchronous Textract API calls (which you probably should because you save on Get* calls, which are TPS limited), you can use the function def get_full_json_from_output_config(output_config: OutputConfig, job_id: str, s3_client=None) -> dict: (source) from the amazon-textract-caller PyPI package.

AWS
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions