1 Answer
- Newest
- Most votes
- Most comments
1
Ultimately, you can solve this using multiple solutons. but let me start with these points addressing your question first:
- Glue would not be ideal for your task of streaming the data into Kinesis stream or Queue. Glue processes data in multiple executors thereby using huge computing capacity.
- Lambda does have a limitation of 15 minutes that would not work for large workloads. One option would be orchestrate the multiple lambda functions in parallel or one after another to process the file(s).
There are some possible solutions I can think of that you can build upon:
- When reading objects from S3 you don't have to read the whole object at once, but you can use range requests to only read a part of the object. To do that all you have to do is to specify the range you want to read when calling get_object(). From the boto3 documentation for get_object(). You could orchestrate Lambdas from Step functions until all the data has been read.
- If your team already has an EC2 instance running in the same region, you could run AWS CLI commands to download to EC2, split the files and upload back to S3. Those files can be processed by each Lambda
- AWS Batch is another option you can consider- which can spin up an EC2 instance for just the time needed to perform the command. This would avoid all the splits and the orchestration. Your code can be in the language of your choice and it can read the files and write them into queues or Kinesis.
- If your team already has an EC2 instance running in the same region, you could execute all the logic to read S3 and write into a queue or kinesis stream. This would avoid all the splits and the orchestration.
answered 2 years ago
Relevant content
- asked a year ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated 3 months ago