AWG Glue for Multi-Tenant Pipeline

0

Hi,

I am searching for a transformation engine that supports multi-tenancy with the following requirements:

  • Each tenant must be transformed every 10 minutes.
  • One tenant transformation transforms a data increment ranging from 10MB to 100MB.
  • Sometimes it is necessary to re-transform all historical data increments of all tenants, which can range from 1GB to 1TB of data per tenant.
  • A failure in a transformation of one tenant can not affect the transformation of other tenants.

When considering a transformation engine for a single customer or one company pipeline, Glue is often presented as a potential option. However, is Glue a suitable transformation tool for these requirements? If yes, which Glue engine is best suited for the multi-tenancy task and why?

Thank you for your help, Jaroslav.

Rambo
질문됨 9달 전343회 조회
1개 답변
1

Hi, Jaroslav. It would be nice to have more details on your requirements, like:

  • Where is the raw data stored ?
  • Where will the processed data be stored ?
  • In which format will the processing data be stored ?
  • Will you need to use any open table format frameworks ? Hudi, Iceberg or Delta ?

Given the provided information short answer is yes. As long as the number of tenants is lower than the max number of concurrent jobs runs per job and the processing script is the same for every tenant OR the number of tenants is less then max number of jobs per account if scripts differ for every tenant (and you don't many have other jobs). First scenario will require you to make your own scheduling logic using a lambda to invoke the job using the Glue API while providing something like a tenant id and maybe file to process for each run (check here for start_job_run API call documentation for boto3 on python).

You may also need one script for incremental processing and another for historical processing, depending on how your raw data is organized on source.

If there's a strictly requirement to use Spark then your only available option is the "Spark Job" which may be a bit of overkill for your incremental data, given that the minimum capacity for this kind of job is 2 worker nodes of 4vCPU and 16GB Memory each (for a total of 2 DPU) priced at $0.88/hour [1]. Anyway, you have 2 options: start with a visual editor to help you get started and switch to script editing to customize your code or start directly on script editing, first and third options in below image:

Enter image description here

Now if you don't have strictly spark requirements you can use the Python Shell job to write "pure" python code (you can use libraries like pandas that leverages compiled code on c). Python Shell jobs lets you choose from 1/16 DPU or 1 DPU for a job, which is more than enough for you incremental data. Then you can have a historical job script on Spark or Ray, sized accordingly to the amount of data to be processed.

There's not a one size fits all approach, it will all depends on your full requirements set, including costs constraints. Also the way you model your raw data storage layer will affect your final design. Multi tenant solutions are usually challenging in my opinion.

Hope this answer give you something to start. If you can answer the questions stated on the begining I can narrow down a little your options.

Best regards,

Arthur.

[1] Glue Pricing

답변함 9달 전
  • This has been incredibly informative. I appreciate the thoroughness and the suggestions provided. Thank you.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠