- Newest
- Most votes
- Most comments
For your workload, I would recommend using the real-time inference option with SageMaker endpoints.
With client-side batching as you are currently doing, you can ensure the order of inputs and outputs is maintained since each batch request is synchronous. Creating multiple instances in the endpoint will not change this behavior.
Compared to the batch transform job option, real-time inference avoids the startup time of 4-6 minutes for batch jobs to be provisioned. Since your jobs arrive infrequently, this startup delay negates any benefits of scaling to multiple instances.
You can provision the endpoint in advance with the required instance type and number of instances based on throughput needs. This way when a prediction job arrives, it will get processed immediately without waiting for resources to be provisioned.
The cost will be based on the time the endpoint is in service. You can terminate the endpoint when not in use to avoid idle costs.
For throughput needs above what a single instance can handle, you can scale out the endpoint to multiple instances as traffic increases.
Relevant content
- asked 7 months ago
- asked 2 years ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago