Can I use aws glue crawlers to create master data in delta lake tables?

0

I am setting up a new data lake and have been tasked with creating the master data tables in the data bricks delta lake component. I'm trying to do this in a use-case agnostic way (or as agnostic as possible), and need to automate the process where possible. I have researched aws glue crawlers, and it seems it is a good way to automatically create a schema and catalog for the data.

However, I'm not sure how to proceed. I'm assuming that creating the master data means identifying common fields in all the data sources and creating a schema for all the data using a single crawler, and then dividing this schema into facts and dimensions. After that I could use spark jobs on data bricks to extract what I need from the raw data and to populate the master data, while checking for duplicates and doing whatever other transformations that need to be done.

This plan seems like it requires a lot of manual labor though, and it's not use case agnostic in any way. Does anyone know how it could be automated further?

Any help would be much appreciated.

1 Answer
0

Hello,

1) Glue crawler can crawl the Data Source and create the tables as per the schema identified from the files of the Data sets.
2) Glue crawler does not validate the common columns in different data sets. 

[1] how glue crawler works : https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions