UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format


Hi, I have a multi-page PDF document which I can process fine and extract key value pair in Amazon Textract web interface. However, when I try to extract key value pairs in my Python code, it returns below error: -

UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

Below is my code: -

    response = textract.analyze_document(
            "S3Object": {
                "Bucket": bucketname,
                "Name": filename,
            "HumanLoopName": uuid.uuid4().hex,
            "FlowDefinitionArn": FLOW_ARN,
            "DataAttributes": {
                "ContentClassifiers": [

    return {
        "statusCode": 200,
        "body": json.dumps("Document processed successfully!"),

return {"statusCode": 500, "body": json.dumps("Issue processing file!")}

I thought because my PDF document is multi-page maybe because of that it is not able to read it so I tried to read the PDF page by page and modified my code to below: -

    # Start document text detection
    response = textract.start_document_text_detection(
            "S3Object": {
                "Bucket": bucketname,
                "Name": filename,
        ClientRequestToken=str(uuid.uuid4())  # Generate a unique client request token
    # Retrieve the job ID from the response
    job_id = response["JobId"]
    # Poll for the completion of the job
    while True:
        job_status = textract.get_document_text_detection(JobId=job_id)['JobStatus']
        if job_status in ['SUCCEEDED', 'FAILED']:
        time.sleep(5)  # Wait for 5 seconds before checking again
    # Get the results of the detection
    response = textract.get_document_text_detection(JobId=job_id)
    # Process each page of the document
    for page_result in response['Blocks']:
        if page_result['BlockType'] == 'PAGE':
            page_number = page_result['Page']
            response = textract.analyze_document(
                    "S3Object": {
                        "Bucket": bucketname,
                        "Name": filename,
                    "HumanLoopName": uuid.uuid4().hex,
                    "FlowDefinitionArn": FLOW_ARN,
                    "DataAttributes": {
                        "ContentClassifiers": [

    return {
        "statusCode": 200,
        "body": json.dumps("Document processed successfully!"),

return {"statusCode": 500, "body": json.dumps("Issue processing file!")}

However, I am still getting the same UnsupportedDocumentException error.

Any help or pointers would be appreciated.


asked 2 months ago141 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions