Texttract - Why does subsequent GetDocumentAnalysisAsync(getResultsRequest) Blocks have no relationships populated

0

Hi and thank you for any help you can provide...

I am using Texttract with the .Net SDK.

I am able to successfully submit and get the results of a TABLE analysis job and now I am trying to loop through the results. I can make my first call to GetDocumentAnalysisAsync then I loop through the blocks that are type CELL so I can get Row and Column ID and then I will look through all of the relationships cells to get the text.

This all works fine up to the point I need to get the next blocks with a call to GetDocumentAnalysisAsync passing the nexttoken ID. I then get the next set of CELL block and begin looping again. This time, none of my CELL blocks have any relationships populated to get the text from. I just get what looks like a lot of empty cells. I have verified the page is readable through the console demo page so there should be text there.

Here is my code (in c#) I am using to iterate through the blocks and retrieve the next set of blocks.


if (getResultsResponse.JobStatus == JobStatus.SUCCEEDED)
            {
                do
                {
                    getResultsResponse.Blocks.ForEach(x => {
                        if (x.BlockType.Equals("CELL"))
                        {
                            Console.WriteLine("Page: " + x.Page.ToString());
                            Console.WriteLine("Rowindex: " + x.RowIndex.ToString());
                            Console.WriteLine("Colindex: " + x.ColumnIndex.ToString());
                            x.Relationships.ForEach(y =>
                            {
                                y.Ids.ForEach(z =>
                                {
                                    var cellText = (from text in getResultsResponse.Blocks where text.Id == z.ToString() select text.Text).FirstOrDefault();
                                    if (!string.IsNullOrEmpty(cellText))
                                    {
                                        Console.Write($"{cellText} ");
                                    }
                                });
                            });
                        }
                    });

                    if (string.IsNullOrEmpty(getResultsResponse.NextToken)) { break; }

                    getResultsRequest.NextToken = getResultsResponse.NextToken;
                    getResultsResponse = await _textractClient.GetDocumentAnalysisAsync(getResultsRequest);
                } while (getResultsResponse.Blocks.Count > 0); 
}

asked 2 years ago285 views
3 Answers
1

Hi,

It seems according to your code that you are replacing the getResultsResponse in every loop by the subset gotten by the NextToken. Usually the order of blocks in a full response in WORD then LINE then CELL, so at first iteration you have the WORD, LINE and CELL in the getResultsResponse which is why you can find the children elements. However when you get the next set of block, you will get only CELL so when you look at the children, they won't be in the subset.

ie, if the full response is [ PAGE_1, WORD_1, WORD_2, TABLE_1, CELL_1, CELL_2], then the getResultsResponse will look like this if we accept 5 blocks by response before going to the next Token :

  • 1st iteration : getResultsResponse = [ PAGE_1, WORD_1, WORD_2, TABLE_1, CELL_1 ] so you will find WORD_1 the child of CELL_1
  • 2nd iteration : getResultsResponse = [ CELL_2 ] so WORD_2 child of CELL_2 will not be part of this list, but is in the first iteration, which is why you cannot find it.

A fix for that would be to first query all the Response part, and concatenate in a single list. Then you can run you script to fetch the children within this concatenate list.

Hope this helps.

AWS
answered 2 years ago
1

To add to the above response, please review the below implementation for the get_full_json function in python. It implements the fix recommended.
https://github.com/aws-samples/amazon-textract-textractor/blob/master/caller/textractcaller/t_call.py

AWS
keithm
answered 2 years ago
0

Collecting all the blocks up front in a Dictionary worked! Thank you!

Here is the modified code...

            Dictionary<string, Block> blocks = new Dictionary<string, Block>();

            if (getResultsResponse.JobStatus == JobStatus.SUCCEEDED)
            {
                do
                {
                    getResultsResponse.Blocks.ForEach(x => {
                        blocks.Add(x.Id, x);
                    });

                    if (string.IsNullOrEmpty(getResultsResponse.NextToken)) { break; }

                    getResultsRequest.NextToken = getResultsResponse.NextToken;
                    getResultsResponse = _textractClient.GetDocumentAnalysis(getResultsRequest);
                } while (getResultsResponse.Blocks.Count > 0); 

                foreach (KeyValuePair<string, Block> entry in blocks)
                {
                    if (entry.Value.BlockType.Equals("CELL"))
                    {
                        Console.WriteLine("Page: " + entry.Value.Page.ToString());
                        Console.WriteLine("Rowindex: " + entry.Value.RowIndex.ToString());
                        Console.WriteLine("Colindex: " + entry.Value.ColumnIndex.ToString());
                        entry.Value.Relationships.ForEach(y =>
                        {
                            y.Ids.ForEach(z =>
                            {
                                if (blocks.ContainsKey(z.ToString()))
                                {
                                    var cellText = blocks[z.ToString()].Text;
                                    if (!string.IsNullOrEmpty(cellText))
                                    {
                                        Console.Write($"{cellText} ");
                                    }
                                    else
                                    {
                                        Console.Write($"EMPTY ");
                                    }
                                }
                            });
                        });
                    }
                }


answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions