- Newest
- Most votes
- Most comments
Hello,
The above error message suggests that Hive was expecting ':' colon but found back-slash. This back-slash '' is working as an escape character for the colon ':'. This backslash is added by the crawler while creating a schema with special characters. Currently, Athena does not support any special character in the schema apart from underscore '_'
It is indeed that the structure of CloudTrail logs is quite standardized. However, at a closer look, structure of nested children fields might vary. Below are nested field structure in the "RequestParameter" field. As you can see, logs from different service API calls have different parameters.
S3 Request: {"bucketName":"xxxx","Host":"s3.eu-west-1.amazonaws.com","key":"xxxx"}
IAM Request: {"resourceArn":"arn:aws:sagemaker:eu-west-1:xxxx:xxxx"}
Thus, for CloudTrail logs, we'd recommend to use the CREATE TABLE DDL mentioned in the below document link[1] [-] https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html#create-cloudtrail-table
It is because this DDL defines a basic structure for the CloudTrail JSON records, but leaves the structure of nested fields flexible (eg. treating the whole RequestParameter field as JSON string). On the other hand, when crawling the data, Glue crawlers will try its best to recognize all the fields it see.
Therefore, I would suggest the below workaround:
- Create the table with Athena DDL. (from the Doc above)
- Run a Glue crawler to add the partitions to the table.
Step 1. Create the table with Athena DDL. (from the Doc above):
CREATE EXTERNAL TABLE cloudtrail_logs_testing (
eventversion STRING,
useridentity STRUCT<
... ...
PARTITIONED BY (region string, year string, month string, day string)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://CloudTrail_bucket_name/AWSLogs/Account_ID/CloudTrail/';
Step 2: Run a Glue crawler to add the partitions to the table
Please consider that this crawler should be created differently. Please find the steps below to create and run the Glue crawler.
a) Sign in to the AWS Management Console and open the AWS Glue console at https://console.aws.amazon.com/glue/. Choose Crawlers in the navigation pane.
b) Choose Add crawler, provide 'Crawler name' and click on next.
c) Under 'Crawler source type', select 'Existing catalog tables' and click on next.
d) Under 'Available tables', click on 'Add' button which is adjacent to the table which was created in Athena i.e. 'cloudtrail_logs_testing' and click on next.
e) Select 'Choose an existing IAM role' and select your IAM role from the drop down list of 'IAM role' and click on next.
f) Under 'Frequency', choose as per your requirement and click on next. I have chosen 'Run on demand'.
g) On 'Configure the crawler's output' page, under 'Configuration options (optional)' select the following options for each section.
- Under 'When the crawler detects schema changes in the data store, how should AWS Glue handle table updates in the data catalog section:
select 'Ignore the change and don't update the table in the data catalog' option and check the box for 'Update all new and existing partitions with metadata from the table' option.
- Under 'How should AWS Glue handle deleted objects in the data store?' section, select 'Ignore the change and don't update the table in the data catalog' and click on next
h) Review all the steps and click on 'Finish' and Run the crawler to add all partition automatically.
Once crawler executed, You can query the data from Athena.
Relevant content
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 2 months ago