AWS Public Blockchain Data is incorrect at multiple days

1

There is public accessed S3 bucket (aws-public-blockchain) with ethereum blockchain data. But it has this issue - data count is incorrect at multiple days. As you can see following query produces 2392744 rows count at BigQuery public blockchain data. BigQuery

At Athena same query on aws-public-blockchain data produces 3895537 rows count. Athena

You can see that folder "s3://aws-public-blockchain/v1.0/eth/logs/date=2023-03-14/" contains files with two different upload time. There are also other folders containing junk data. Querying data from that day can produce misleading information. As I understand this error occurred when data receiving process temporarily stopped and restarted afterwards. S3 folder

These are my two main requests:

  1. Eliminate junk data so that clean data can be used for some purposes.
  2. "value" column in token_transfers table has double as type, but because original data coming from blockchain doesn't comes in int256 here we lose precision. Values can be really big up to 115792089237316195423570985008687907853269984665640564039457584007913129639935 So I suggest switching to string type.
  3. Is there a way to increase update rate of data? As I understand data in that bucket gets updated once every day. It would be better to get data updated at more frequent rate.
2 Answers
1

Hello, thanks for bringing this to us. The data has been updated and your requested changes should all be reflected. Please let us know in case you find anything else or see things not fully addressed in your opinion.

Regarding the 3rd question: What kind of data update rate would be most helpful for you? Than this can be taken into consideration for potential future updates.

AWS
answered 10 months ago
  • Hello and thanks for commenting. But for logs table there are still these 2 days' folder with incorrect files in them: 2022-11-05 and 2022-11-01. Also, data for 2023-06-12 and 2023-06-11 is missing.

    I wasn't too clear about second point. Here I wanted to mention that currently "token_transfers" table's "value" column has 'double' type. You can check that same table at BigQuery has 'string' type. And if you check blockchain explorer (for example) at this page you can see that transferred token values can be bigger than double type can handle without losing precision. Value can be up to 115792089237316195423570985008687907853269984665640564039457584007913129639935 that is max of uint256. If you store that value as string it will satisfy everyone using it. Those who want double, can safely cast to double in their queries. So, could you please reupload just "token_transfers" table's data with "value" as string?

    Coming to upload rate it would be best to make data streamed from blockchain. BigQuery supports that. If not possible than it will also be good to have data refreshed every hour. Thank you in advance!!

0

Hi,

To address your query, I suggest reaching out to the team responsible for managing the dataset. You can log a ticket on this GitHub page below https://github.com/aws-solutions-library-samples/guidance-for-digital-assets-on-aws/tree/main/analytics

When it comes to handling "junk" data, excluding patterns in Athena can be a tedious task, particularly if your dataset is frequently updated. In such cases, using Redshift [1] may provide better results. Additionally, making changes to the dataset requires control over it. You can copy the data from the AWS blockchain data and apply transformation functions using tools like Spark, Hive, or other big data ETL tools. For automating the ingestion of new data, you can utilize S3 replication rules [2] to copy the data automatically to your bucket. Moreover, leveraging Intelligent Tiering can help reduce costs by removing old data that is no longer needed for querying.

I hope this information proves helpful. Should you require further assistance, please let me know.

[1] Redshift Spectrum Data Files: Amazon Redshift Documentation [2] Configuring Replication: Amazon S3 User Guide

odwa_y
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions