Texttract - Why does subsequent GetDocumentAnalysisAsync(getResultsRequest) Blocks have no relationships populated

0

Hi and thank you for any help you can provide...

I am using Texttract with the .Net SDK.

I am able to successfully submit and get the results of a TABLE analysis job and now I am trying to loop through the results. I can make my first call to GetDocumentAnalysisAsync then I loop through the blocks that are type CELL so I can get Row and Column ID and then I will look through all of the relationships cells to get the text.

This all works fine up to the point I need to get the next blocks with a call to GetDocumentAnalysisAsync passing the nexttoken ID. I then get the next set of CELL block and begin looping again. This time, none of my CELL blocks have any relationships populated to get the text from. I just get what looks like a lot of empty cells. I have verified the page is readable through the console demo page so there should be text there.

Here is my code (in c#) I am using to iterate through the blocks and retrieve the next set of blocks.


if (getResultsResponse.JobStatus == JobStatus.SUCCEEDED)
            {
                do
                {
                    getResultsResponse.Blocks.ForEach(x => {
                        if (x.BlockType.Equals("CELL"))
                        {
                            Console.WriteLine("Page: " + x.Page.ToString());
                            Console.WriteLine("Rowindex: " + x.RowIndex.ToString());
                            Console.WriteLine("Colindex: " + x.ColumnIndex.ToString());
                            x.Relationships.ForEach(y =>
                            {
                                y.Ids.ForEach(z =>
                                {
                                    var cellText = (from text in getResultsResponse.Blocks where text.Id == z.ToString() select text.Text).FirstOrDefault();
                                    if (!string.IsNullOrEmpty(cellText))
                                    {
                                        Console.Write($"{cellText} ");
                                    }
                                });
                            });
                        }
                    });

                    if (string.IsNullOrEmpty(getResultsResponse.NextToken)) { break; }

                    getResultsRequest.NextToken = getResultsResponse.NextToken;
                    getResultsResponse = await _textractClient.GetDocumentAnalysisAsync(getResultsRequest);
                } while (getResultsResponse.Blocks.Count > 0); 
}

已提问 2 年前292 查看次数
3 回答
1

Hi,

It seems according to your code that you are replacing the getResultsResponse in every loop by the subset gotten by the NextToken. Usually the order of blocks in a full response in WORD then LINE then CELL, so at first iteration you have the WORD, LINE and CELL in the getResultsResponse which is why you can find the children elements. However when you get the next set of block, you will get only CELL so when you look at the children, they won't be in the subset.

ie, if the full response is [ PAGE_1, WORD_1, WORD_2, TABLE_1, CELL_1, CELL_2], then the getResultsResponse will look like this if we accept 5 blocks by response before going to the next Token :

  • 1st iteration : getResultsResponse = [ PAGE_1, WORD_1, WORD_2, TABLE_1, CELL_1 ] so you will find WORD_1 the child of CELL_1
  • 2nd iteration : getResultsResponse = [ CELL_2 ] so WORD_2 child of CELL_2 will not be part of this list, but is in the first iteration, which is why you cannot find it.

A fix for that would be to first query all the Response part, and concatenate in a single list. Then you can run you script to fetch the children within this concatenate list.

Hope this helps.

AWS
已回答 2 年前
1

To add to the above response, please review the below implementation for the get_full_json function in python. It implements the fix recommended.
https://github.com/aws-samples/amazon-textract-textractor/blob/master/caller/textractcaller/t_call.py

AWS
keithm
已回答 2 年前
0

Collecting all the blocks up front in a Dictionary worked! Thank you!

Here is the modified code...

            Dictionary<string, Block> blocks = new Dictionary<string, Block>();

            if (getResultsResponse.JobStatus == JobStatus.SUCCEEDED)
            {
                do
                {
                    getResultsResponse.Blocks.ForEach(x => {
                        blocks.Add(x.Id, x);
                    });

                    if (string.IsNullOrEmpty(getResultsResponse.NextToken)) { break; }

                    getResultsRequest.NextToken = getResultsResponse.NextToken;
                    getResultsResponse = _textractClient.GetDocumentAnalysis(getResultsRequest);
                } while (getResultsResponse.Blocks.Count > 0); 

                foreach (KeyValuePair<string, Block> entry in blocks)
                {
                    if (entry.Value.BlockType.Equals("CELL"))
                    {
                        Console.WriteLine("Page: " + entry.Value.Page.ToString());
                        Console.WriteLine("Rowindex: " + entry.Value.RowIndex.ToString());
                        Console.WriteLine("Colindex: " + entry.Value.ColumnIndex.ToString());
                        entry.Value.Relationships.ForEach(y =>
                        {
                            y.Ids.ForEach(z =>
                            {
                                if (blocks.ContainsKey(z.ToString()))
                                {
                                    var cellText = blocks[z.ToString()].Text;
                                    if (!string.IsNullOrEmpty(cellText))
                                    {
                                        Console.Write($"{cellText} ");
                                    }
                                    else
                                    {
                                        Console.Write($"EMPTY ");
                                    }
                                }
                            });
                        });
                    }
                }


已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则