AWS S3 `getObject` in java is slower than python's

0

I've noticed that using the getObject method in the AWS SDK v2 in Java is slower than using the getObject method in the boto3 library in python.

Setup: I am reading a 1GB file stored in a S3 bucket in us-east-1. I use a i3en.2xlarge EC2 instance in a VPC with an Endpoint gateway for S3. The VM is located in us-east-1 too. I used the same EC2 VM to execute the python and java benchmarks, so I think I can discard deployment or setup problems. OS: Ubuntu 22.04

Java 11, AWS SDK v2.29.25 Python 3.10.12, boto3 v1.35.54

Java code:

import software.amazon.awssdk.core.ResponseBytes;
import software.amazon.awssdk.services.s3.S3Client;
import software.amazon.awssdk.services.s3.model.GetObjectRequest;

import java.time.Duration;
import java.time.Instant;

public class TestS3 {

    public static void main(String[] args) {
        String bucket = args[0];
        String key = args[1];
    
        Instant t0 = Instant.now();
        S3Client client = S3Client.builder().build();
        GetObjectRequest request = GetObjectRequest.builder()
                .bucket(bucket)
                .key(key)
                .build();
    
        ResponseBytes responseBytes = client.getObjectAsBytes(request);
        byte [] data = responseBytes.asByteArray();
        Instant t1 = Instant.now();
        int size = data.length;
        Duration duration = Duration.between(t0, t1);
        double throughput = (double) size / duration.toSeconds();

        System.out.println("Time to read: " + duration.toSeconds() + " s");
        System.out.println("Size: " + size + " B");
        System.out.println("Size: " + size / (1024.0 * 1024.0) + " MB");
        System.out.println("Throughput: " + throughput + " B/s");
        System.out.println("Throughput: " + throughput / (1024.0 * 1024.0) + " MB/s");
        
    }
}

Python code:

import sys
import time
import boto3

bucket = sys.argv[1]
key = sys.argv[2]

t0 = time.time()
client = boto3.client('s3')
obj = client.get_object(Bucket=bucket, Key=key)
data = obj['Body'].read()
t1 = time.time()

size = len(data)
throughput = size / (t1 - t0)
print(f"Time to read: {t1 - t0} s")
print(f"Size: {size} B")
print(f"Size: {size / (1024 * 1024)} MB")
print(f"Throughput: {throughput} B/s")
print(f"Throughput: {throughput / (1024 * 1024)} MB/s")

Java results (3 runs):

# Run 1
Time to read: 18.369 s
Size: 1073741824 B
Size: 1024.0 MB
Throughput: 5.845401622298437E7 B/s
Throughput: 55.74609396265447 MB/s

# Run 2
Time to read: 13.227 s
Size: 1073741824 B
Size: 1024.0 MB
Throughput: 8.117803160202615E7 B/s
Throughput: 77.41740379526725 MB/s

# Run 3
Time to read: 15.643 s
Size: 1073741824 B
Size: 1024.0 MB
Throughput: 6.86404029917535E7 B/s
Throughput: 65.46058940101004 MB/s

Python results (3 runs):

# Run 1
Time to read: 13.131076335906982 s
Size: 1073741824 B
Size: 1024.0 MB
Throughput: 81771044.24135046 B/s
Throughput: 77.98294471869512 MB/s

# Run 2
Time to read: 10.911798000335693 s
Size: 1073741824 B
Size: 1024.0 MB
Throughput: 98401915.42832512 B/s
Throughput: 93.84337942917358 MB/s

# Run 3
Time to read: 10.976710796356201 s
Size: 1073741824 B
Size: 1024.0 MB
Throughput: 97819997.62227829 B/s
Throughput: 93.28841936328725 MB/s

Comments: There is an approximate gap in the performance of 30 MB/s. The results of the python benchmark are the same as the AWS CLI with the configuration max_concurrent_requests = 1, which makes sense, because, as far as I know, the CLI is implemented in python.

asked 3 months ago87 views
1 Answer
0

The performance difference you're observing between Java and Python when using the getObject method to retrieve data from Amazon S3 can be attributed to several factors.

First, it's important to note that the AWS SDK for Java v2 and the boto3 library for Python may have different default configurations and optimizations. The boto3 library, being closely tied to the AWS CLI, might have more optimized default settings for S3 operations.

However, there's a significant development that could help improve the Java SDK's performance. AWS has recently announced updates to the AWS CLI and AWS SDK for Python (Boto3) that integrate with the AWS Common Runtime (CRT) S3 client. This integration is designed to deliver high-throughput data transfer to and from Amazon S3.

The CRT S3 client implements several performance optimizations automatically, including:

  1. Request parallelization
  2. Request timeouts and retries
  3. Connection reuse and management

These optimizations can lead to significant performance improvements, with reported speedups of 2-6x across various benchmarks.

While this update is currently enabled by default on specific EC2 instance types (Trn1, P4d, and P5), it can be enabled as an opt-in feature on other instance types, including the i3en.2xlarge you're using.

To potentially improve the performance of your Java code, you could try the following:

  1. Ensure you're using the latest version of the AWS SDK for Java v2.
  2. Consider enabling the CRT S3 client in your Java code. This might require some configuration changes and possibly adding the CRT as a dependency to your project.
  3. Experiment with different configuration settings, such as increasing the number of concurrent requests or adjusting the chunk size for multipart uploads/downloads.

It's worth noting that the performance characteristics can vary depending on factors such as network conditions, S3 bucket location, and the specific workload. Always benchmark with your specific use case to determine the best configuration.

If after trying these optimizations you still see a significant performance gap, you might want to consider reaching out to AWS support for more detailed guidance, as there could be other factors at play specific to your setup or use case.
Sources
Accelerate Amazon S3 throughput with the AWS Common Runtime | AWS Storage Blog

profile picture
answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions