Textrac returns JSON with only the first 2 pages

Question

Hello. We are trying to develop an app for which we want to use Textract to perform OCR on documents, but when uploading PDF documents to a bucket via the API it returns a JSON file with only the first 2 pages of a document that has more than 30... My question is, is this happening because I am still within the 3-month trial period? If so, I want to pay for the service to unlock that restriction, but I haven't found where to make the change. Or maybe The problem is another...
Estoy usando un depósito S3 para cargar el PDF antes y luego procesarlo desde allí con start_document_text_detection y luego get_document_text_detection...
Thanks

Accepted Answer

Hi Carlos,

When we call "get_document_text_detection", it returns paginated results, along with "NextToken". \
We can use the "NextToken" to iteratively call and fetch the rest of the parts of the results. [1]

Please have a look at [2] and [3], for referring to the part on how to use the "NextToken". \
Example Code Snippet: 
```
def getJobResults(jobId):
    pages = []
    client = boto3.client('textract')

response = client.get_document_text_detection(JobId=jobId) 
    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']

while(nextToken):
        response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)
        pages.append(response)
        print("Resultset page recieved: {}".format(len(pages)))
        nextToken = None
        if('NextToken' in response):
            nextToken = response['NextToken']
    return pages
```

References: \
[1] https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentTextDetection.html \
[2] https://github.com/aws-samples/amazon-textract-serverless-large-scale-document-processing/blob/master/src/jobresultsproc.py \
[3] https://medium.com/petabytz/automatically-extract-data-using-aws-textract-7a599b80b92

Answer

And, for reference, if you use the OutputConfig in asynchronous Textract API calls (which you probably should because you save on Get* calls, which are TPS limited), you can use the function ```def get_full_json_from_output_config(output_config: OutputConfig, job_id: str, s3_client=None) -> dict:``` [(source)](https://github.com/aws-samples/amazon-textract-textractor/blob/6e7125c51a351900089102bee1ef2c679c635df2/caller/textractcaller/t_call.py#L262) from the [amazon-textract-caller](https://pypi.org/project/amazon-textract-caller/) PyPI package.

Answer

Thanks!!! Works perfectly!!!

Textrac returns JSON with only the first 2 pages

Relevanter Inhalt