- Newest
- Most votes
- Most comments
Hi, Jaroslav. It would be nice to have more details on your requirements, like:
- Where is the raw data stored ?
- Where will the processed data be stored ?
- In which format will the processing data be stored ?
- Will you need to use any open table format frameworks ? Hudi, Iceberg or Delta ?
Given the provided information short answer is yes. As long as the number of tenants is lower than the max number of concurrent jobs runs per job and the processing script is the same for every tenant OR the number of tenants is less then max number of jobs per account if scripts differ for every tenant (and you don't many have other jobs). First scenario will require you to make your own scheduling logic using a lambda to invoke the job using the Glue API while providing something like a tenant id and maybe file to process for each run (check here for start_job_run API call documentation for boto3 on python).
You may also need one script for incremental processing and another for historical processing, depending on how your raw data is organized on source.
If there's a strictly requirement to use Spark then your only available option is the "Spark Job" which may be a bit of overkill for your incremental data, given that the minimum capacity for this kind of job is 2 worker nodes of 4vCPU and 16GB Memory each (for a total of 2 DPU) priced at $0.88/hour [1]. Anyway, you have 2 options: start with a visual editor to help you get started and switch to script editing to customize your code or start directly on script editing, first and third options in below image:
Now if you don't have strictly spark requirements you can use the Python Shell job to write "pure" python code (you can use libraries like pandas that leverages compiled code on c). Python Shell jobs lets you choose from 1/16 DPU or 1 DPU for a job, which is more than enough for you incremental data. Then you can have a historical job script on Spark or Ray, sized accordingly to the amount of data to be processed.
There's not a one size fits all approach, it will all depends on your full requirements set, including costs constraints. Also the way you model your raw data storage layer will affect your final design. Multi tenant solutions are usually challenging in my opinion.
Hope this answer give you something to start. If you can answer the questions stated on the begining I can narrow down a little your options.
Best regards,
Arthur.
[1] Glue Pricing
Relevant content
- asked 6 months ago
- asked a year ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 7 months ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 6 months ago
This has been incredibly informative. I appreciate the thoroughness and the suggestions provided. Thank you.