New datasets added to the AWS Public Blockchain Datasets: available for analytics and research

5 minute read
Content level: Foundational
2

The AWS Public Blockchain Datasets now includes data from five additional networks (Aptos, Arbitrum, Base, Provenance, and XRPL) provided by SonarX, enabling free access to blockchain data through Amazon S3 and analysis with AWS Glue and Amazon Athena.

The AWS Public Blockchain Datasets, available under the Open Data program, have been expanded with the addition of five new chains: Aptos, Arbitrum, Base, Provenance, and XRPL (XRP Ledger). These datasets join our existing Bitcoin and Ethereum offerings, expanding the possibilities for blockchain research and analytics on AWS.

Background

In 2022, we launched the AWS Public Blockchain Datasets to provide researchers and developers with free access to blockchain data. The datasets are stored as Parquet files in Amazon S3, partitioned by date for optimal query performance. In collaboration with SonarX, an AWS Partner specializing in blockchain data indexing, we ensure these datasets are regularly updated and maintain high quality standards.

Dataset access and organization

All blockchain datasets are publicly available in the s3://aws-public-blockchain bucket. The data is organized hierarchically by chain and date, making it simple to query specific time ranges or perform cross-chain analysis. Each blockchain's data follows a consistent schema, with common fields like transaction hash, sender, receiver, and timestamp.

Here is the list of datasets and their respective public Amazon S3 URLs:

  • Bitcoin: s3://aws-public-blockchain/v1.0/btc/
  • Ethereum: s3://aws-public-blockchain/v1.0/eth/
  • (new) Aptos: s3://aws-public-blockchain/v1.1/sonarx/aptos/
  • (new) Arbitrum: s3://aws-public-blockchain/v1.1/sonarx/arbitrum/
  • (new) Base: s3://aws-public-blockchain/v1.1/sonarx/base/
  • (new) Provenance: s3://aws-public-blockchain/v1.1/sonarx/provenance/
  • (new) XRPL: s3://aws-public-blockchain/v1.1/sonarx/xrp/

Use Cases

Our customers use these datasets in various ways:

  • Research and Analytics: Academic institutions and government organizations analyze network behavior patterns and conduct cross-chain comparative studies
  • Business Intelligence: Companies monitor network activity and user behavior to inform strategic decisions
  • Risk and Compliance: Financial institutions use the data for transaction monitoring, fraud detection, and auditing purposes.
  • DeFi and Trading: Trading firms backtest strategies and analyze market dynamics

Getting Started

Let's walk through setting up and querying these datasets using AWS services.

Create a Glue Crawler

First, we'll create a Glue Crawler to catalog our data. You can use the following AWS CLI commands. For additional information about Glue, you can refer to the documentation.

Create the necessary IAM role:

aws iam create-role \
    --role-name AWSGlueServiceRole-Blockchain \
    --assume-role-policy-document '{
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": "glue.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }'

Create the policy document for S3 access:

aws iam put-role-policy \
    --role-name AWSGlueServiceRole-Blockchain \
    --policy-name S3Access \
    --policy-document '{
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:ListBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::aws-public-blockchain",
                    "arn:aws:s3:::aws-public-blockchain/*"
                ]
            }
        ]
    }'

Attach the AWS managed policy for Glue Service:

aws iam attach-role-policy \
    --role-name AWSGlueServiceRole-Blockchain \
    --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

Create the Glue Crawler, with a scheduled run:

aws glue create-crawler \
 --name blockchain-crawler-base \
 --role AWSGlueServiceRole-Blockchain \
 --database-name sonarx_base \
 --targets '{"S3Targets": [{"Path": "s3://aws-public-blockchain/v1.1/sonarx/base/", "SampleSize": 2}]}' \
 --schedule "cron(0 5 * * ? *)" 

Run the Glue Crawler: aws glue start-crawler --name blockchain-crawler-base

The first time you start the crawler, it might take anywhere between one to five minutes to fetch files from the datasets, and assemble a fitting data schema. You can monitor the first run of the AWS Glue Crawler from the Console (assuming us-east-1 as region), or by running the following command:

aws glue get-crawler --name blockchain-crawler-base --query '[Crawler][0].State'

Query the Data with Amazon Athena

Once the crawler has run, you can query the data using Amazon Athena. Here's a query to get daily transaction counts for Base:

SELECT 
    date_trunc('day', datetime) as date,
    COUNT(*) as tx_count
FROM "sonarx_base"."transactions"
WHERE date >= date_format(current_date - interval '30' day, '%Y-%m-%d')
GROUP BY date_trunc('day', datetime)
ORDER BY date DESC;

Let's try something more interesting - comparing daily active users across different chains. As a requirement, you will need to create and run a Glue Crawler for a second blockchain.

WITH daily_users AS (
    SELECT 
        'base' as chain,
        date_trunc('day', datetime) as date,
        COUNT(DISTINCT from_address) as unique_users
    FROM "sonarx_base"."transactions"
    WHERE date >= date_format(current_date - interval '14' day, '%Y-%m-%d')
    GROUP BY date_trunc('day', datetime)
    
    UNION ALL
    
    SELECT 
        'arbitrum' as chain,
        date_trunc('day', datetime) as date,
        COUNT(DISTINCT from_address) as unique_users
    FROM "sonarx_arbitrum"."transactions"
    WHERE date >= date_format(current_date - interval '14' day, '%Y-%m-%d')
    GROUP BY date_trunc('day', datetime)
)
SELECT 
    chain,
    date,
    unique_users,
    unique_users - LAG(unique_users) OVER (PARTITION BY chain ORDER BY date) as daily_change
FROM daily_users
ORDER BY chain, date;

Conclusion

The AWS Public Blockchain datasets power data-driven workloads, with datasets accessible free of charge. Data can be accessed without movement, speeding up data pipelines and access to valuable information.

Check out SonarX’s website for an in-depth selection of data for over 70+ different blockchains. Let us know what you build with these datasets! Share your experience in the comments below.

profile pictureAWS
EXPERT
published 2 months ago395 views