Skip to content

Extract section headers from a multi-page PDF

0

Hi, I need help with getting the Python code for extracting section_headers from a multi-page PDF.

asked 10 months ago448 views
2 Answers
0

Amazon Textract's layout analysis feature uses ML model(s) to identify which detected text elements are headings, and will likely perform better than the rule-based heuristics (upper-case or ending-with-colon) suggested by the re:Post Agent answer.

To use it, you would:

  1. Analyze the document with FeatureTypes=['LAYOUT'] enabled (either the synchronous analyze_document or async start_document_analysis, depending on your input file type... But async is generally recommended for scalability.
  2. Check through LAYOUT_SECTION_HEADER and LAYOUT_TITLE detections returned in the response, which will link back to the underlying LINEs of detected text... From which you can extract whatever information you need (for e.g. the heading text itself; its position & page number; the content of the document between two headings; and so on).

Amazon Textract API responses include a lot of information, so navigating through their structure can be tricky to build from scratch. If you're using Python, I'd strongly recommend checking out the Amazon Textract Textractor package which can parse the response into a more easy-to-query native Python object with utility properties and methods. Similarly for JavaScript/TypeScript, there is TRP.js.


It's worth noting: the Textract layout feature carries additional price vs using text detection only (OCR) - so if all your documents follow a simple and standardized structure, then a heuristic approach as mentioned in the other answer might still make sense. Even if doing heuristics though, you might find it easier to use Textractor in your Python code than to loop through the raw response JSON.

AWS
EXPERT
answered 10 months ago
-1

To extract section headers from a multi-page PDF using Amazon Textract, you can follow these steps:

  1. Set up the necessary AWS services, including Amazon S3 to store your PDF file, and configure your AWS credentials.

  2. Use the Amazon Textract asynchronous API (Async API) which is designed for processing multi-page documents like PDFs.

  3. Here's a basic outline of the Python code you would use:

import boto3
import time

def start_document_analysis(bucket, document):
    textract = boto3.client('textract')
    response = textract.start_document_analysis(
        DocumentLocation={'S3Object': {'Bucket': bucket, 'Name': document}},
        FeatureTypes=["TABLES", "FORMS"]
    )
    return response['JobId']

def get_document_analysis(job_id):
    textract = boto3.client('textract')
    response = textract.get_document_analysis(JobId=job_id)
    status = response['JobStatus']
    while status == 'IN_PROGRESS':
        time.sleep(5)
        response = textract.get_document_analysis(JobId=job_id)
        status = response['JobStatus']
    return response

def extract_section_headers(response):
    section_headers = []
    for block in response['Blocks']:
        if block['BlockType'] == 'LINE':
            # You may need to adjust this logic based on your PDF structure
            if block['Text'].isupper() or block['Text'].endswith(':'):
                section_headers.append(block['Text'])
    return section_headers

# Usage
bucket = 'your-s3-bucket-name'
document = 'your-pdf-file-name.pdf'

job_id = start_document_analysis(bucket, document)
result = get_document_analysis(job_id)
headers = extract_section_headers(result)
print(headers)

This code starts an asynchronous document analysis job, waits for it to complete, and then extracts potential section headers based on some basic rules (all uppercase text or lines ending with a colon). You may need to refine the extract_section_headers function based on the specific structure of your PDF.

Remember to handle pagination if your PDF has many pages, as the get_document_analysis function may need to be called multiple times with a NextToken to retrieve all results.

Also, ensure you have the necessary IAM permissions to use Amazon Textract and access your S3 bucket. You may need to adjust the code to handle errors and edge cases in a production environment.
Sources
Detecting or Analyzing Text in a Multipage Document - Amazon Textract
Automatically extract content from PDF files using Amazon Textract - AWS Prescriptive Guidance

answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.