- Newest
- Most votes
- Most comments
Hi, yellow submarine!
Thanks for your thoughtful question about using AWS services as a custom data source in Security Lake. Let’s break it down step by step, ensuring we’re clear on the process and concepts.
Clarifying the Issue
You’re asking if AWS services can act as a custom data source in Security Lake, alongside third-party services. The key here is understanding OCSF compliance and data formatting requirements.
Security Lake accepts data from both AWS-native and third-party services, but the data must meet specific requirements:
- It must conform to the OCSF (Open Cybersecurity Schema Framework) for consistent organization and querying. This step applies to both AWS and third-party sources.
- It must be optimized in Parquet format, a high-performance storage format for big data. Parquet isn’t exclusive to third-party services—it’s a universal requirement.
With these two requirements in mind, you’re correct that transforming your data into OCSF-compliant Parquet format, partitioning it, and storing it under the ext/
prefix will enable Security Lake to integrate it seamlessly. Now, let’s explore these components in more detail.
Key Terms
- OCSF (Open Cybersecurity Schema Framework): A universal schema designed to standardize security data across diverse tools and services. Security Lake uses OCSF to normalize data from multiple sources, ensuring compatibility and simplifying analysis. This is critical for third-party integrations.
- Parquet Format: A columnar storage format used for efficient compression and querying. It’s required for all data sources (AWS or third-party) in Security Lake to optimize performance during analysis.
- AWS Glue: A fully managed service for data cataloging and ETL (Extract, Transform, Load). It’s central to transforming raw data into the formats and structures needed by Security Lake.
- Security Lake: An AWS service that centralizes and organizes security data into a data lake. It supports both AWS-native services (like CloudTrail and GuardDuty) and third-party sources, as long as the data meets the required OCSF and Parquet standards.
The Solution (Our Recipe)
-
Transform Data into OCSF Format:
- For AWS-native logs (e.g., CloudTrail), minimal transformation may be needed as these services are aligned with OCSF. For third-party services, use AWS Glue or Lambda to map and convert logs into OCSF-compliant JSON.
-
Convert to Parquet:
- Use AWS Glue jobs to transform the OCSF JSON data into Parquet format. This step optimizes storage and querying. Parquet’s columnar nature is particularly suited for large-scale analytics.
-
Partition the Data:
- Organize data by logical keys (e.g.,
year
,month
,day
, orevent_type
). This improves query efficiency and cost-effectiveness. Glue ETL jobs can partition data during the transformation process.
- Organize data by logical keys (e.g.,
-
Place Data Under
ext/
Prefix:- Save the partitioned Parquet data in the
ext/
prefix of the Security Lake S3 bucket. Example structure:
s3://<your-security-lake-bucket>/ext/year=2024/month=12/day=18/
- Save the partitioned Parquet data in the
-
Integrate with Glue Data Catalog:
- Configure an AWS Glue Crawler to scan the
ext/
path and update the Glue Data Catalog. This step ensures tables and schemas are registered for querying.
- Configure an AWS Glue Crawler to scan the
-
Validate with Athena:
- Use Amazon Athena to query the data and verify integration. Example:
SELECT * FROM your_table_name WHERE year = 2024 AND month = 12;
- Ensure query results align with your expectations.
- Use Amazon Athena to query the data and verify integration. Example:
1. Transform Data to OCSF | v 2. Convert to Parquet | v 3. Partition Data | v 4. Store Under `ext/` Prefix | v 5. Run Glue Crawler | v 6. Query and Validate with Athena
Example Use Case
Imagine you’re centralizing security logs from a third-party service like Splunk. Start by exporting the logs and mapping them to the OCSF schema. Use AWS Glue to transform this data into Parquet format, partition it by time or event type, and store it under the ext/
prefix in your Security Lake S3 bucket. Set up a Glue Crawler to register this data in the Glue Data Catalog, then use Athena to run queries and validate. This process makes your Splunk logs a fully integrated, queryable custom data source in Security Lake.
Closing Thoughts
Your understanding of the process is solid! The distinction between OCSF compliance (necessary for both AWS and third-party services) and Parquet formatting (required for all data) is key to setting up a custom data source in Security Lake. Follow the steps above, and you’ll be on your way to seamless integration. If you run into challenges, AWS documentation or support is there to help.
I hope this explanation clears up any confusion and empowers you to move forward with confidence. Let me know if there’s anything else you’d like clarified. Good luck with Security Lake! 😊
Cheers, Aaron 😊
Relevant content
- asked a year ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated 3 years ago
Hello, Aaron !
Thank you for your kindly response and detailed explanation.
I have a question regarding your explanation. In the solution (Our Recipe), you mentioned configuring an AWS Glue Crawler to scan the ext/ path and update the Glue Data Catalog to ensure the catalogs are registered for querying. However, in this video https://www.youtube.com/watch?v=8MDP3LX2A-A, It states that the Security Lake automatically handles this step - specifically setting up the Glue Crawler to partition the data, populate the Glue Data Catalog with tables, and automatically extract source data for schema definition.
Do we need to manually create and run crawler for ext/ path to populate the Glue Catalog and add partitions whenever new objects are added tot he specific s3 path? If not, then I assume the recipe would look like this.
ext/
Prefix | vIf my understanding is incorrect, could you please clarify? I’d appreciate any corrections or insights.
Looking forward to your response, Thank you!
Hi yellow submarine,
Thanks for your thoughtful follow-up question! You’re absolutely correct that AWS Security Lake automates many processes, including running Glue Crawlers to partition data, populate the Glue Data Catalog, and extract schema definitions for data stored in the ext/ prefix. For most users, this automation works seamlessly and simplifies the setup process.
In our original recipe, we included the manual Glue Crawler step because it offers a deeper understanding of what’s happening behind the scenes. While the automated setup is convenient, it operates with several underlying assumptions that may not always be clear. Manual setups ensure you have full control over the process, which can be particularly valuable in a troubleshooting context or if custom configurations are needed.
To answer your specific question: No, you do not need to manually create and run a Glue Crawler for Security Lake to populate the Glue Data Catalog. The workflow you outlined—transforming data to OCSF, converting to Parquet, partitioning, storing it under the ext/ prefix, and querying with Athena—is accurate for an automated setup.
Thanks again for catching this distinction! Both approaches have their place, and we’re glad to have the chance to discuss this
Cheers, Aaron 😊