AWG Glue for Multi-Tenant Pipeline

Question

Hi,

I am searching for a transformation engine that supports multi-tenancy with the following requirements:

* Each tenant must be transformed every 10 minutes.
* One tenant transformation transforms a data increment ranging from 10MB to 100MB.
* Sometimes it is necessary to re-transform all historical data increments of all tenants, which can range from 1GB to 1TB of data per tenant.
* A failure in a transformation of one tenant can not affect the transformation of other tenants.

When considering a transformation engine for a single customer or one company pipeline, Glue is often presented as a potential option. However, is Glue a suitable transformation tool for these requirements? If yes, which Glue engine is best suited for the multi-tenancy task and why?

Thank you for your help, Jaroslav.

Answer

Hi, Jaroslav. It would be nice to have more details on your requirements, like:
* Where is the raw data stored ?
* Where will the processed data be stored ?
* In which format will the processing data be stored ?
* Will you need to use any open table format frameworks ? Hudi, Iceberg or Delta ?

Given the provided information short answer is **yes**. As long as the number of tenants is lower than the [max number of concurrent jobs runs per job](https://docs.aws.amazon.com/general/latest/gr/glue.html) and the processing script is the same for every tenant OR the number of tenants is less then [max number of jobs per account](https://docs.aws.amazon.com/general/latest/gr/glue.html) if scripts differ for every tenant (and you don't many have other jobs). First scenario will require you to make your own scheduling logic using a lambda to invoke the job using the Glue API while providing something like a tenant id and maybe file to process for each run (check [here](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue/client/start_job_run.html) for start_job_run API call documentation for boto3 on python).

You may also need one script for incremental processing and another for historical processing, depending on how your raw data is organized on source.

If there's a strictly requirement to use Spark then your only available option is the "Spark Job" which may be a bit of overkill for your incremental data, given that the minimum capacity for this kind of job is 2 worker nodes of 4vCPU and 16GB Memory each (for a total of 2 DPU) priced at $0.88/hour [1]. Anyway, you have 2 options: start with a visual editor to help you get started and switch to script editing to customize your code or start directly on script editing, first and third options in below image:

![Enter image description here](/media/postImages/original/IMacioEM_gTti686oZXMKc7A)

Now if you don't have strictly spark requirements you can use the Python Shell job to write "pure" python code  (you can use libraries like pandas that leverages compiled code on c). Python Shell jobs lets you choose from 1/16 DPU or 1 DPU for a job, which is more than enough for you incremental data. Then you can have a historical job script on Spark or Ray, sized accordingly to the amount of data to be processed.

There's not a one size fits all approach, it will all depends on your full requirements set, including costs constraints. Also the way you model your raw data storage layer will affect your final design. Multi tenant solutions are usually challenging in my opinion.

Hope this answer give you something to start. If you can answer the questions stated on the begining I can narrow down a little your options.

Best regards,

Arthur.

[1] [Glue Pricing](https://aws.amazon.com/glue/pricing/)

AWG Glue for Multi-Tenant Pipeline

관련 콘텐츠