How are documents processed to extract text for knowledge bases?

0

Hi,

I've reviewed the Amazon Bedrock Knowledge Base documentation but couldn't find details on text extraction during preprocessing. The "How it Works" section leaves this part abstract: How it Works - Amazon Bedrock

How are documents processed to extract text for knowledge bases?

Thanks!

3 Answers
1

If you're referring to the source document file, it must be in one of the following supported formats:

FormatExtension
Plain text.txt
Markdown.md
HyperText Markup Language.html
Microsoft Word document.doc/.docx
Comma-separated values.csv
Microsoft Excel spreadsheet.xls/.xlsx
Portable Document.pdf

⚡ You can locate this information in the guide on setting up a data source for your knowledge base, available at Set up a data source for your knowledge base.


💡 If this answer doesn't meet your expectations, could you please clarify your question so I can better address your concerns?

profile picture
EXPERT
answered 17 days ago
profile pictureAWS
EXPERT
reviewed 17 days ago
  • Hi,

    Thanks for the response. I did check out the guide on setting up a data source for the knowledge base, but I’m specifically interested in delving deeper into the process of text extraction from documents.

    To clarify, while I understand the general flow of actions involved in setting up a knowledge base, what I’m particularly keen on is understanding the intricacies of the "Text Extraction From documents" phase. Our documents often contain a lot of what we call "dirty" text in PDFs, and we've struggled to extract clear text from them. It seems like Amazon has been able to handle this effectively, and I'm curious if there are insights or techniques we could learn from your approach.

    Essentially, I’m wondering if there are any specific methods or services Amazon employs to achieve high-quality text extraction from PDFs, and whether these are available for use or integration into our own processes.

    Thanks for your help!

1

Hi There

It works by converting the text into a numerical representation that LLM's can understand.

Imagine you have a bunch of documents, like essays, articles, or reports, that you want to store in a database. The goal is to make it easy to find specific information in these documents when you need it.

The first step is to split the documents into smaller pieces, called "chunks." This makes it easier to search through the information later on. Think of it like breaking a big book into smaller chapters or sections.

Next, these chunks are converted into a special kind of code called "embeddings." Embeddings are like a way to represent the meaning of the text in a mathematical form that a computer can understand. This helps the computer figure out how similar the chunks are to each other, or to a question you might ask.

The embeddings are then stored in a "vector index." This is like a special type of database that's optimized for quickly finding the most relevant chunks based on the embeddings. It keeps track of where each chunk came from, so you can go back to the original document if you need to.

Finally, when you have a question or search term, the computer can use the vector index to find the chunks that are most similar to your query. This helps you quickly find the information you're looking for, without having to read through the entire collection of documents.

The image shows how this whole process works, from splitting the documents into chunks, to creating the embeddings, to storing them in the vector index. It's a handy way to make sure you can find the information you need, even in a big collection of documents.

Enter image description here

profile pictureAWS
EXPERT
Matt-B
answered 16 days ago
  • Thank you for your detailed explanation. And I am sorry for not clarifying myself in the original post. I did check out the guide on setting up a data source for the knowledge base, but I’m specifically interested in delving deeper into the process of text extraction from documents.

    To clarify, while I understand the general flow of actions involved in setting up a knowledge base, what I’m particularly keen on is understanding the intricacies of the "Text Extraction From documents" phase. Our documents often contain a lot of what we call "dirty" text in PDFs, and we've struggled to extract clear text from them. It seems like Amazon has been able to handle this effectively, and I'm curious if there are insights or techniques we could learn from your approach.

    Essentially, I’m wondering if there are any specific methods or services Amazon employs to achieve high-quality text extraction from PDFs, and whether these are available for use or integration into our own processes.

    Thanks for your help!

0

Moving your follow up to a new answer.

Thank you for your detailed explanation. And I am sorry for not clarifying myself in the original post. I did check out the guide on setting up a data source for the knowledge base, but I’m specifically interested in delving deeper into the process of text extraction from documents. To clarify, while I understand the general flow of actions involved in setting up a knowledge base, what I’m particularly keen on is understanding the intricacies of the "Text Extraction From documents" phase. Our documents often contain a lot of what we call "dirty" text in PDFs, and we've struggled to extract clear text from them. It seems like Amazon has been able to handle this effectively, and I'm curious if there are insights or techniques we could learn from your approach. Essentially, I’m wondering if there are any specific methods or services Amazon employs to achieve high-quality text extraction from PDFs, and whether these are available for use or integration into our own processes.

It seems like you are asking how Amazon Q extracts text from the PDF. Q converts your PDF to HTML and then extracts the text. See https://docs.aws.amazon.com/amazonq/latest/qbusiness-ug/doc-types.html#doc-types-supported

Can you clarify what you mean by "dirty text"? If you are referring to handwriting or scanned PDF files, that isn't supported. If you are struggling with scanned PDF's or handwriting, you might want to run the documents through Amazon Textract first, and then ingest the raw text output into Q. Take a look at https://aws.amazon.com/blogs/machine-learning/process-text-and-images-in-pdf-documents-with-amazon-textract/

profile pictureAWS
EXPERT
Matt-B
answered 16 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions