Building a Data Lake in AWS - Choosing the Right Approach and Services


Hello team,

We are planning to build a data lake in AWS that will contain regularly extracted data from an on-prem data warehouse. The purpose of this data lake is to serve the following purposes in real-time:

  • Enable multiple services and lines of business to access the data.
  • Support real-time analytics.
  • Verify client identity using the on-prem data warehouse data inject into the data lake.

Our primary question is whether it is best to keep the data only within the data lake for the mentioned scenarios or to keep them on the data lake and also move them to an RDS database to be used for identity validation for example... However, please note that due to potential human errors during data entry, we cannot ensure data integrity within the RDS database (as certain table constraints need to be disabled for data injection).

The data may be provided in CSV format.

Our queries are as follows:

  • Which option would be more suitable: storing the data in the data lake or moving it to RDS?
  • What AWS products and services are recommended for building a data lake that will be utilized by multiple services, lines of businesses and real-time analytics? Thank you for your guidance in advance.

Best regards,

2 Answers


building a data lake is a complex topic and there isn't usually a single "best" solution. It depends a lot on your requirements, data volumes, organizational structure, security and compliance guidelines and so on. I recommend taking a look at this blog post which provides an architecture that incorporates many best practices and has been successfully implemented by many customers.

Please upvote/accept this answer if you found it helpful

profile pictureAWS
answered 9 months ago

In order to create a data lake on AWS you can consider the AWS Lake Formation[1] offering. AWS Lake Formation enables you to set up a secure data lake. You can store your data as-is, without having first to structure it.

You can also run different types of analytics to better guide decision-making—from dashboards and visualizations, to big data processing, real-time analytics, and machine learning.

Further, Lake Formation provides its own permissions model that augments the IAM permissions model. This centrally defined permissions model enables fine-grained access to data stored in data lakes through a simple grant or revoke mechanism, much like a relational database management system (RDMS).

Lake Formation permissions are enforced using granular controls at the column, row, and cell-levels across AWS analytics and machine learning services, including Amazon Athena, Amazon QuickSight, and Amazon Redshift.

You can also review other AWS Services that can be integrated with Lake Formation at: [+] AWS service integrations with Lake Formation -

Due to the vast potential that Lake Formation presents, you are not just limited to using RDS and can explore other methods using service integrations and find a solution tailored to your particular use case. One such solution could be to integrate AWS Glue data Catalog and crawlers[2].

For a beginners guide please refer to the AWS Blog "Getting started with AWS Lake Formation": [+]

Lastly, to understand your particular use case better and provide a tailored solution, we may require details that are non-public information. Please open a support case with AWS using the following link: [+]


[1] What is AWS Lake Formation? -

[2] Data Catalog and crawlers in AWS Glue -

answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions