- Newest
- Most votes
- Most comments
Amazon Textract's layout analysis feature uses ML model(s) to identify which detected text elements are headings, and will likely perform better than the rule-based heuristics (upper-case or ending-with-colon) suggested by the re:Post Agent answer.
To use it, you would:
- Analyze the document with
FeatureTypes=['LAYOUT']enabled (either the synchronous analyze_document or async start_document_analysis, depending on your input file type... But async is generally recommended for scalability. - Check through
LAYOUT_SECTION_HEADERandLAYOUT_TITLEdetections returned in the response, which will link back to the underlyingLINEs of detected text... From which you can extract whatever information you need (for e.g. the heading text itself; its position & page number; the content of the document between two headings; and so on).
Amazon Textract API responses include a lot of information, so navigating through their structure can be tricky to build from scratch. If you're using Python, I'd strongly recommend checking out the Amazon Textract Textractor package which can parse the response into a more easy-to-query native Python object with utility properties and methods. Similarly for JavaScript/TypeScript, there is TRP.js.
It's worth noting: the Textract layout feature carries additional price vs using text detection only (OCR) - so if all your documents follow a simple and standardized structure, then a heuristic approach as mentioned in the other answer might still make sense. Even if doing heuristics though, you might find it easier to use Textractor in your Python code than to loop through the raw response JSON.
To extract section headers from a multi-page PDF using Amazon Textract, you can follow these steps:
-
Set up the necessary AWS services, including Amazon S3 to store your PDF file, and configure your AWS credentials.
-
Use the Amazon Textract asynchronous API (Async API) which is designed for processing multi-page documents like PDFs.
-
Here's a basic outline of the Python code you would use:
import boto3 import time def start_document_analysis(bucket, document): textract = boto3.client('textract') response = textract.start_document_analysis( DocumentLocation={'S3Object': {'Bucket': bucket, 'Name': document}}, FeatureTypes=["TABLES", "FORMS"] ) return response['JobId'] def get_document_analysis(job_id): textract = boto3.client('textract') response = textract.get_document_analysis(JobId=job_id) status = response['JobStatus'] while status == 'IN_PROGRESS': time.sleep(5) response = textract.get_document_analysis(JobId=job_id) status = response['JobStatus'] return response def extract_section_headers(response): section_headers = [] for block in response['Blocks']: if block['BlockType'] == 'LINE': # You may need to adjust this logic based on your PDF structure if block['Text'].isupper() or block['Text'].endswith(':'): section_headers.append(block['Text']) return section_headers # Usage bucket = 'your-s3-bucket-name' document = 'your-pdf-file-name.pdf' job_id = start_document_analysis(bucket, document) result = get_document_analysis(job_id) headers = extract_section_headers(result) print(headers)
This code starts an asynchronous document analysis job, waits for it to complete, and then extracts potential section headers based on some basic rules (all uppercase text or lines ending with a colon). You may need to refine the extract_section_headers function based on the specific structure of your PDF.
Remember to handle pagination if your PDF has many pages, as the get_document_analysis function may need to be called multiple times with a NextToken to retrieve all results.
Also, ensure you have the necessary IAM permissions to use Amazon Textract and access your S3 bucket. You may need to adjust the code to handle errors and edge cases in a production environment.
Sources
Detecting or Analyzing Text in a Multipage Document - Amazon Textract
Automatically extract content from PDF files using Amazon Textract - AWS Prescriptive Guidance
Relevant content
- asked 4 years ago
- asked 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 8 months ago
