This question should be in Textract category, but system places it in ML for some unknown reason.
I am interfacing with the Textract API using the AWS PHP SDK. I'd like to use the new Textract Layout feature announced here: https://aws.amazon.com/about-aws/whats-new/2023/09/amazon-textract-layout-feature-extract-paragraphs-titles-documents/ and further discussed in this article: https://aws.amazon.com/blogs/machine-learning/amazon-textracts-new-layout-feature-introduces-efficiencies-in-general-purpose-and-generative-ai-document-processing-tasks/
Specifically, I'd like to use the **Layout with AnalyzeDocument Layout Feature **as shown in the article example. Unfortunately for me, the code example is in Python:
The following code snippet generates the layout-linearized text from the document. You can use either method to generate the linearized text from the document using the latest version of Amazon Textract Textractor Python library.
import textractcaller as tc
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import get_text_from_layout_json
layout_textract_json = call_textract(input_document = input_document,
features = [Textract_Features.LAYOUT])
layout_text = get_text_from_layout_json(textract_json = layout_textract_json)[1]
print(layout_text)
I want to accomplish the same as above, but in PHP. Additionally I want to:
-
Read PDF locally (from local file system) and generate layout-linearized text like you see in example.
-
Generate text extract with the Layout feature discussed above, but excluding:
- Page numbers
- Page headers
- Page footers
Can someone show me a code example of this in PHP?
Thanks!
UPDATE:
Using ChatGPT, I was at least able to cobble together PHP code that extracts text from a PDF:
<?php
require '../../vendor/autoload.php';
use Aws\S3\S3Client;
use Aws\Textract\TextractClient;
use Aws\Exception\AwsException;
// Create an S3 client
$s3Client = new S3Client([
'region' => 'us-west-2',
'version' => 'latest',
'credentials' => [
'key' => getenv('AWS_ACCESS_KEY_ID'),
'secret' => getenv('AWS_SECRET_ACCESS_KEY')
]
]);
// Create a Textract client
$textractClient = new TextractClient([
'region' => 'us-west-2',
'version' => '2018-06-27',
'credentials' => [
'key' => getenv('AWS_ACCESS_KEY_ID'),
'secret' => getenv('AWS_SECRET_ACCESS_KEY')
]
]);
// Specify the path to the local file
$localFilePath = './13_article.pdf';
// Specify the S3 bucket and object key where you want to upload the file
$bucketName = 'docs.scbbs.com';
$objectKey = 'docs/test/13_article.pdf';
// Upload the local file to S3
$result = $s3Client->putObject([
'Bucket' => $bucketName,
'Key' => $objectKey,
'SourceFile' => $localFilePath
]);
if ($result['@metadata']['statusCode'] == 200) {
echo "File uploaded to S3\n";
} else {
echo "Failed to upload file to S3\n";
exit;
}
// Prepare the request for StartDocumentAnalysis
$options = [
'DocumentLocation' => [
'S3Object' => [
'Bucket' => $bucketName,
'Name' => $objectKey
]
],
'FeatureTypes' => ['TABLES', 'FORMS', 'LAYOUT'] // Include 'LAYOUT' in the feature types
];
try {
// Call the StartDocumentAnalysis API
$result = $textractClient->startDocumentAnalysis($options);
echo "Document analysis started\n";
$jobId = $result['JobId'];
echo "Job Id: $jobId\n";
// Poll for the job status
do {
sleep(30); // wait for 30 seconds before checking the status again
$result = $textractClient->getDocumentAnalysis(['JobId' => $jobId]);
$jobStatus = $result['JobStatus'];
echo "Job status: $jobStatus\n";
} while ($jobStatus == 'IN_PROGRESS');
if ($jobStatus == 'SUCCEEDED') {
// Fetch all pages of the result
$nextToken = null;
do {
$params = [
'JobId' => $jobId
];
if ($nextToken) {
$params['NextToken'] = $nextToken;
}
$response = $textractClient->getDocumentAnalysis($params);
foreach ($response['Blocks'] as $block) {
if ($block['BlockType'] === 'LINE') {
echo $block['Text'] . "\n";
}
}
$nextToken = $response['NextToken'] ?? null;
} while ($nextToken);
}
} catch (AwsException $e) {
// Handle AWS exceptions
echo "AWS Error: " . $e->getAwsErrorMessage() . "\n";
} catch (\Exception $e) {
// Handle general exceptions
echo "General Error: " . $e->getMessage() . "\n";
}
?>
This is the source pdf: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/13_Article_11_1.pdf
This is the text that is output using above code: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/13_article-output+using+04.txt
The code does not implement the Layout feature, so the output text contains page numbers, page headers and page footers.
Can someone explain how I can implement the Layout feature with this code to get the desired output?
Thank you for the response. Problem is that it doesn't answer the question: How do I do this using PHP SDK? I am selecting the feature type recommended in article:
'FeatureTypes' => ['TABLES', 'FORMS', 'LAYOUT'] // Include 'LAYOUT' in the feature types
But I see nothing that explains how I retrieve just the lines I want, excluding headers and footers and page numbers.