- 최신
- 최다 투표
- 가장 많은 댓글
AWS has multiple options for this kind of workload that can be used. Prescribing a solution is harder without having all the details regarding producers/consumers and other requirements. I till try to give you some light regarding a few options.
S3 is well suited to be a data lake. You will keep raw data there for processing somewhere. Usually, ETLs will spun up, download data from S3, process it and save in another datastore.
This second datastore will be the data warehouse (DW) where you have some data that has been processed and has some business value. From there it should be easier to run analytics jobs, because DW solutions are usually optimized for that kind of things (like Redshift).
As for speed, it depends on a bunch of factors.
- Is your data spread in multiple files where you could process them in parallel?
- Can you optimize the code?
- Are you hitting CPU/memory/IO limits?
- Is the download time (from S3) acceptable?
Sorry for not having a more prescriptive answer, but I hope that helps you a little bit.
관련 콘텐츠
- AWS 공식업데이트됨 일 년 전
Thanks for responding. Are you suggeting that the performance will be better if I move this data from EBS to S3?
No, that wasn't my intention. To make the performance faster you really should identify what is the bottleneck. It could be CPU, Memory, IO performance or even the EBS bandwidth. Performance is also not always tied to infrastructure, so having some visibility on the ETL itself can also give you some clues.
S3 is suited for data retention, but the ETL will have to download data from there before being able to process it. The data is usually saved to an EBS disk and later loaded into memory for processing, but it depends on the ETL.
So you can see that both EBS and S3 will be part of the whole process. The difference is that you should avoid using EBS for data retention, but you can also consider using provisioned IOPS for better storage performance on EBS.
Actually, I am referring to file system here.