Utilizing new Amazon Textract Layout Features in PHP

Question

This question should be in Textract category, but system places it in ML for some unknown reason.

I am interfacing with the Textract API using the AWS PHP SDK.  I'd like to use the new Textract Layout feature announced here:  https://aws.amazon.com/about-aws/whats-new/2023/09/amazon-textract-layout-feature-extract-paragraphs-titles-documents/ and further discussed in this article: https://aws.amazon.com/blogs/machine-learning/amazon-textracts-new-layout-feature-introduces-efficiencies-in-general-purpose-and-generative-ai-document-processing-tasks/

Specifically, I'd like to use the **Layout with AnalyzeDocument Layout Feature **as shown in the article example.  Unfortunately for me, the code example is in Python:

> The following code snippet generates the layout-linearized text from the document. You can use either method to generate the linearized text from the document using the latest version of Amazon Textract Textractor Python library.

```
import textractcaller as tc
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import get_text_from_layout_json

layout_textract_json = call_textract(input_document = input_document,
                                     features = [Textract_Features.LAYOUT])
layout_text = get_text_from_layout_json(textract_json = layout_textract_json)[1]
print(layout_text)
```

I want to accomplish the same as above, but in PHP.  Additionally I want to:

1. Read PDF locally (from local file system) and generate layout-linearized text like you see in example.

2. Generate text extract with the Layout feature discussed above, but excluding:
* Page numbers
* Page headers
* Page footers

Can someone show me a code example of this in PHP?

Thanks!

UPDATE:

Using ChatGPT, I was at least able to cobble together PHP code that extracts text from a PDF:

```
 'us-west-2',
    'version' => 'latest',
    'credentials' => [
        'key' => getenv('AWS_ACCESS_KEY_ID'),
        'secret' => getenv('AWS_SECRET_ACCESS_KEY')
    ]
]);

// Create a Textract client
$textractClient = new TextractClient([
    'region' => 'us-west-2',
    'version' => '2018-06-27',
    'credentials' => [
        'key' => getenv('AWS_ACCESS_KEY_ID'),
        'secret' => getenv('AWS_SECRET_ACCESS_KEY')
    ]
]);

// Specify the path to the local file
$localFilePath = './13_article.pdf';

// Specify the S3 bucket and object key where you want to upload the file
$bucketName = 'docs.scbbs.com';
$objectKey = 'docs/test/13_article.pdf';

// Upload the local file to S3
$result = $s3Client->putObject([
    'Bucket' => $bucketName,
    'Key' => $objectKey,
    'SourceFile' => $localFilePath
]);

if ($result['@metadata']['statusCode'] == 200) {
    echo "File uploaded to S3
";
} else {
    echo "Failed to upload file to S3
";
    exit;
}

// Prepare the request for StartDocumentAnalysis
$options = [
    'DocumentLocation' => [
        'S3Object' => [
            'Bucket' => $bucketName,
            'Name' => $objectKey
        ]
    ],
    'FeatureTypes' => ['TABLES', 'FORMS', 'LAYOUT']  // Include 'LAYOUT' in the feature types
];

try {
    // Call the StartDocumentAnalysis API
    $result = $textractClient->startDocumentAnalysis($options);
    echo "Document analysis started
";
    $jobId = $result['JobId'];
    echo "Job Id: $jobId
";

// Poll for the job status
    do {
        sleep(30); // wait for 30 seconds before checking the status again
        $result = $textractClient->getDocumentAnalysis(['JobId' => $jobId]);
        $jobStatus = $result['JobStatus'];
        echo "Job status: $jobStatus
";
    } while ($jobStatus == 'IN_PROGRESS');

if ($jobStatus == 'SUCCEEDED') {
        // Fetch all pages of the result
        $nextToken = null;
        do {
            $params = [
                'JobId' => $jobId
            ];
            if ($nextToken) {
                $params['NextToken'] = $nextToken;
            }
            $response = $textractClient->getDocumentAnalysis($params);
            foreach ($response['Blocks'] as $block) {
                if ($block['BlockType'] === 'LINE') {
                    echo $block['Text'] . "
";
                }
            }
            $nextToken = $response['NextToken'] ?? null;
        } while ($nextToken);
    }
} catch (AwsException $e) {
    // Handle AWS exceptions
    echo "AWS Error: " . $e->getAwsErrorMessage() . "
";
} catch (\Exception $e) {
    // Handle general exceptions
    echo "General Error: " . $e->getMessage() . "
";
}

?>
```
This is the source pdf: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/13_Article_11_1.pdf 
This is the text that is output using above code: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/13_article-output+using+04.txt

The code does not implement the Layout feature, so the output text contains page numbers, page headers and page footers.

Can someone explain how I can implement the Layout feature with this code to get the desired output?

Answer

Thank you for using Amazon Textract.

You can refer to this document to understand what blocks Layout feature can return in the response and parse the json response accordingly: https://docs.aws.amazon.com/textract/latest/dg/layoutresponse.html

Utilizing new Amazon Textract Layout Features in PHP

Relevant content