Texttract - Why does subsequent GetDocumentAnalysisAsync(getResultsRequest) Blocks have no relationships populated

0

Hi and thank you for any help you can provide...

I am using Texttract with the .Net SDK.

I am able to successfully submit and get the results of a TABLE analysis job and now I am trying to loop through the results. I can make my first call to GetDocumentAnalysisAsync then I loop through the blocks that are type CELL so I can get Row and Column ID and then I will look through all of the relationships cells to get the text.

This all works fine up to the point I need to get the next blocks with a call to GetDocumentAnalysisAsync passing the nexttoken ID. I then get the next set of CELL block and begin looping again. This time, none of my CELL blocks have any relationships populated to get the text from. I just get what looks like a lot of empty cells. I have verified the page is readable through the console demo page so there should be text there.

Here is my code (in c#) I am using to iterate through the blocks and retrieve the next set of blocks.


if (getResultsResponse.JobStatus == JobStatus.SUCCEEDED)
            {
                do
                {
                    getResultsResponse.Blocks.ForEach(x => {
                        if (x.BlockType.Equals("CELL"))
                        {
                            Console.WriteLine("Page: " + x.Page.ToString());
                            Console.WriteLine("Rowindex: " + x.RowIndex.ToString());
                            Console.WriteLine("Colindex: " + x.ColumnIndex.ToString());
                            x.Relationships.ForEach(y =>
                            {
                                y.Ids.ForEach(z =>
                                {
                                    var cellText = (from text in getResultsResponse.Blocks where text.Id == z.ToString() select text.Text).FirstOrDefault();
                                    if (!string.IsNullOrEmpty(cellText))
                                    {
                                        Console.Write($"{cellText} ");
                                    }
                                });
                            });
                        }
                    });

                    if (string.IsNullOrEmpty(getResultsResponse.NextToken)) { break; }

                    getResultsRequest.NextToken = getResultsResponse.NextToken;
                    getResultsResponse = await _textractClient.GetDocumentAnalysisAsync(getResultsRequest);
                } while (getResultsResponse.Blocks.Count > 0); 
}

질문됨 2년 전292회 조회
3개 답변
1

Hi,

It seems according to your code that you are replacing the getResultsResponse in every loop by the subset gotten by the NextToken. Usually the order of blocks in a full response in WORD then LINE then CELL, so at first iteration you have the WORD, LINE and CELL in the getResultsResponse which is why you can find the children elements. However when you get the next set of block, you will get only CELL so when you look at the children, they won't be in the subset.

ie, if the full response is [ PAGE_1, WORD_1, WORD_2, TABLE_1, CELL_1, CELL_2], then the getResultsResponse will look like this if we accept 5 blocks by response before going to the next Token :

  • 1st iteration : getResultsResponse = [ PAGE_1, WORD_1, WORD_2, TABLE_1, CELL_1 ] so you will find WORD_1 the child of CELL_1
  • 2nd iteration : getResultsResponse = [ CELL_2 ] so WORD_2 child of CELL_2 will not be part of this list, but is in the first iteration, which is why you cannot find it.

A fix for that would be to first query all the Response part, and concatenate in a single list. Then you can run you script to fetch the children within this concatenate list.

Hope this helps.

AWS
답변함 2년 전
1

To add to the above response, please review the below implementation for the get_full_json function in python. It implements the fix recommended.
https://github.com/aws-samples/amazon-textract-textractor/blob/master/caller/textractcaller/t_call.py

AWS
keithm
답변함 2년 전
0

Collecting all the blocks up front in a Dictionary worked! Thank you!

Here is the modified code...

            Dictionary<string, Block> blocks = new Dictionary<string, Block>();

            if (getResultsResponse.JobStatus == JobStatus.SUCCEEDED)
            {
                do
                {
                    getResultsResponse.Blocks.ForEach(x => {
                        blocks.Add(x.Id, x);
                    });

                    if (string.IsNullOrEmpty(getResultsResponse.NextToken)) { break; }

                    getResultsRequest.NextToken = getResultsResponse.NextToken;
                    getResultsResponse = _textractClient.GetDocumentAnalysis(getResultsRequest);
                } while (getResultsResponse.Blocks.Count > 0); 

                foreach (KeyValuePair<string, Block> entry in blocks)
                {
                    if (entry.Value.BlockType.Equals("CELL"))
                    {
                        Console.WriteLine("Page: " + entry.Value.Page.ToString());
                        Console.WriteLine("Rowindex: " + entry.Value.RowIndex.ToString());
                        Console.WriteLine("Colindex: " + entry.Value.ColumnIndex.ToString());
                        entry.Value.Relationships.ForEach(y =>
                        {
                            y.Ids.ForEach(z =>
                            {
                                if (blocks.ContainsKey(z.ToString()))
                                {
                                    var cellText = blocks[z.ToString()].Text;
                                    if (!string.IsNullOrEmpty(cellText))
                                    {
                                        Console.Write($"{cellText} ");
                                    }
                                    else
                                    {
                                        Console.Write($"EMPTY ");
                                    }
                                }
                            });
                        });
                    }
                }


답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠