Gathering QuickSight Dataset details for showback cost reporting
This article explains how you can get more detailed SPICE dataset information for cost reporting
AWS Cost and Usage Reports only shows the cost of the total SPICE usage in an account. You have significant usage of SPICE by different teams in datasets needs to be able to show the costs of individual datasets for show back or charge back reporting to teams.
Solution Summary
The solution is to extract the metadata of all the Amazon QuickSight datasets into parquet format file daily, storing it in Amazon S3 via AWS Lambda. Lambda runs on a daily schedule and triggers a Glue Crawler to crawl the bucket, making a table of the contents in Athena. This table can be brought into Amazon QuickSight as a dataset for reporting. This includes converting the size of the dataset from bytes to GB, and adding all tags on a dataset as fields so that they are usable in Amazon QuickSight to report the cost of the datasets by tag.
We collect the Name, Arn, DataSetId, LastUpdatedTime, and ConsumedSpiceCapacityInBytes for each dataset. Additional handling of the metadata is needed improving reporting. SPICE cost is in GB, but dataset size is reported in bytes. The first addition is to add a field which converts ConsumedSpiceCapacityInBytes to GB and adds the field ConsumedSpiceCapacityInGB.
We add the field DateScanned capturing the date when the dataset data was collected, because we are storing the data for historical purposes and running running a collection of dataset data every day.
When you have a significant number of datasets, you need to be able to group those datasets to the relevant teams that are responsible for those datasets. For that we make an additional call to retrieving the tags for each dataset, adding them to the dataframe for each dataset. Instead of adding the tags as a single JSON field, we add them as individual fields making it easier for reporting. We accomplish this by creating a tag dictionary and update the dataframe.
The function will run once a day, writing a single file per day to the S3 bucket and triggering AWS Glue to then crawl and update the Athena table. You can query the data directly in Athena. Or leverage it as a dataset in QuickSight for business reporting with Cloud Intelligence Dashboard. You can also ingest the data into the reporting tool of your choice
Overview of Solution
You’ll need to deploy an Amazon S3 bucket and AWS Glue Crawler with this AWS Lambda, here I am focusing on the
Permissions
Lambda permissions:
quicksight:ListDataSets
quicksight:DescribeDataSet
quicksight:DescribeDataSetPermissions
quicksight:ListTagsForResource
glue:StartCrawler
Glue Crawler permissions:
s3:GetObject
s3:ListBucket
Converting Bytes to GB
We start by getting the value for GB to divide the byte value by to get GB.
# Assuming consumed_spice_capacity is already defined in bytes
bytes_to_gb = 1024 ** 3 # 1 GB = 1024^3 bytes
Later when building the dataframe, do the math to convert the value to GB in line.
'ConsumedSpiceCapacityInGB': consumed_spice_capacity / bytes_to_gb if consumed_spice_capacity else None
Getting Dataset metadata
In order to get the metadata we need for each dataset, we need to make two separate calls. Dataset metadata does not list tags as part of the metadata output. A separate call is needed to get the tags along with the basic meta data.
for dataset in datasets:
dataset_id = dataset['DataSetId']
dataset_name = dataset['Name']
dataset_arn = dataset['Arn']
last_updated_time_time = dataset['LastUpdatedTime']
last_updated_time = last_updated_time_time.astimezone(pytz.UTC)
try:
response = quicksight.describe_data_set(AwsAccountId=account_id, DataSetId=dataset_id)
consumed_spice_capacity = response['DataSet']['ConsumedSpiceCapacityInBytes']
# Get tags for the dataset
tags = quicksight.list_tags_for_resource(ResourceArn=dataset_arn)['Tags']
# Create tag dictionary
tag_dict = {f"Tag_{tag['Key']}": tag['Value'] for tag in tags}
The last line in the sample above loops through the tag dictionary to add each tag it finds as a field to the dataframe that we are building. This simplifies the code so that we don't have to worry about handling specific tags. This may result in more operational tags being brought over then you care about for reporting purposes.
Writing out to parquet
I initially started with pyarrow to attempt to write the dataframes out to a parquet file in S3. I ran into a lot of datatype issues. I found more success using pandas to get the data with less manipulation into parquet.
I had to do some manipulation of the date fields in order to get them into parquet and to have the Glue Crawler be able to read the data and create the tables. Right now I have them as string fields and need to do some more work on getting them as timestamps.
Lambda Function
import boto3
from datetime import datetime, timezone
import pandas as pd
from botocore.exceptions import ClientError
import os
import logging
def lambda_handler(event, context):
# Set up AWS clients
quicksight = boto3.client('quicksight')
s3 = boto3.client('s3')
glue = boto3.client('glue')
# Get the AWS account ID
account_id = context.invoked_function_arn.split(":")[4]
# Get the current UTC time
current_utc = datetime.now(timezone.utc)
# Format the date as a string (e.g., '2023-06-15T14:30:00Z')
date_string = current_utc.strftime('%Y-%m-%dT%H:%M:%SZ')
# Get bucket name from environment variable
bucket_name = os.environ['S3_BUCKET_NAME']
# Get Glue crawler name from environment variable
crawler_name = os.environ['GLUE_CRAWLER_NAME']
# Assuming consumed_spice_capacity is already defined in bytes
bytes_to_gb = 1024 ** 3 # 1 GB = 1024^3 bytes
# Get a list of all datasets in the QuickSight account
datasets = []
datasets = []
paginator = quicksight.get_paginator('list_data_sets')
for page in paginator.paginate(AwsAccountId=account_id):
datasets.extend(page['DataSetSummaries'])
dataset_details = []
for dataset in datasets:
dataset_id = dataset['DataSetId']
dataset_name = dataset['Name']
dataset_arn = dataset['Arn']
last_updated_time = dataset['LastUpdatedTime']
try:
response = quicksight.describe_data_set(AwsAccountId=account_id, DataSetId=dataset_id)
consumed_spice_capacity = response['DataSet']['ConsumedSpiceCapacityInBytes']
# Get tags for the dataset
tags = quicksight.list_tags_for_resource(ResourceArn=dataset_arn)['Tags']
# Create tag dictionary
tag_dict = {f"Tag_{tag['Key']}": tag['Value'] for tag in tags}
except ClientError as e:
print(f"Error describing dataset {dataset_id}: {e}")
consumed_spice_capacity = None
tag_dict = {}
dataset_detail = {
'DataSetId': dataset_id,
'Name': dataset_name,
'Arn': dataset_arn,
'ConsumedSpiceCapacityInBytes': consumed_spice_capacity,
'ConsumedSpiceCapacityInGB': consumed_spice_capacity / bytes_to_gb if consumed_spice_capacity else None,
'LastUpdatedTime': last_updated_time,
'DateScanned': date_string
}
# Add tag data to dataset_detail
dataset_detail.update(tag_dict)
dataset_details.append(dataset_detail)
df = pd.DataFrame(dataset_details)
df['LastUpdatedTime'] = pd.to_datetime(df['LastUpdatedTime'], utc=True)
df['LastUpdatedTime'] = df['LastUpdatedTime'].dt.floor('D')
df['LastUpdatedTime'] = df['LastUpdatedTime'].dt.strftime('%Y-%m-%d 00:00:00.000')
df['DateScanned'] = pd.to_datetime(df['DateScanned'], utc=True)
df['DateScanned'] = df['DateScanned'].dt.floor('D')
df['DateScanned'] = df['DateScanned'].dt.strftime('%Y-%m-%d 00:00:00.000')
object_key = f'quicksight_datasets_{date_string}.parquet'
# Save DataFrame to a temporary file
temp_file = f"/tmp/{object_key}"
df.to_parquet(temp_file, engine='fastparquet')
# Upload the file to S3
s3.upload_file(temp_file, bucket_name, f"dshistory/{object_key}")
# Clean up the temporary file
os.remove(temp_file)
return {
'statusCode': 200,
'body': f'Successfully processed {len(dataset_details)} datasets and saved to S3'
}
Now you have the tools you need to implement this in your own environment.
Relevant content
- AWS OFFICIALUpdated 3 months ago
- asked 2 years agolg...
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 7 months ago