By using AWS re:Post, you agree to the AWS re:Post Terms of Use

Disappointed in AWS EC2 compute performance:

1

I was looking to expedite some simulations using an EC2 console. The simulations are based on a C executable whose performance is dominated by CPU performance. I was using an Apple M1 processor with 8 cores, and 16 GB of memory and was hoping to expedite the simulations using an EC2 high performance computing resource. I chose the c8g.4xlarge instance (16 CPU, 32 GB memory, ARM64 processor). I benchmarked its simulatiion times against the Apple M1 processor and was rather surprised and disappointed. A comparison of the results is provided in the attached png file.

Comparison of CPU times: Apple M1 vs. AWS c8g.4xlarge

asked 2 months ago211 views
12 Answers
2

All it takes to produce performance differences of that type and scale for computationally intensive workloads is one or a small handful of operations that get mapped to instructions optimised for those operations on one processor and either not at all or not to as efficiently optimised instructions on another processor platform.

Did you compile your program with a compiler and options that support Graviton4-specific instructions and optimisations? For GCC 13, the performance-optimised compiler flag would be -mcpu=neoverse-v2. Details are explained here: https://github.com/aws/aws-graviton-getting-started/blob/main/c-c++.md#cc-on-graviton

As the bot wrote, specific operations may perform differently even without purpose-built optimisations on different processor models. If enabling the relevant compiler options for Graviton4 still fails to produce the expected performance, one of the EC2 instance types using older Graviton processor models or specific processor models from AMD or Intel might outperform Graviton4 with your specific computations, particularly when compiled specifically for those target processors.

EXPERT
answered 2 months ago
profile picture
EXPERT
reviewed 2 months ago
0

Dear Leo,

Thank you again for reviewing my note and all your comments (once again!). To respond to your thoughts in our prior exchange, I did create a new EC2 c7i.4xlarge instance with a customized CPU setting of 1 to disable hyperthreading. I re-ran the benchmark I am using using the "default" gcc compiler settings as was used in the c7i,4xlarge instance with hyperthreading enabled (its default as you noted).

The updated CPU matrix (corrected per your comments) and the added distributions of elapsed times are provided as Table 2 and Figure 2.

[Table 2]](/media/postImages/original/IMOxUajI3_SSSI6GptfkeIQQ) Figure 2

In response to your latest comments:

The variance across all tests (except the c7i) is quite large. You list std deviation as ~60 seconds for c8g, and 40 seconds for Apple M1, however the M1 has 4 results above 31 minutes, which I would expect would make std-deviation even larger than c8g. Is that calculation correct?

The statistics are computed as part of a gnuplot script that uses the original elapsed time data. I just double checked the values of the standard deviations for one of the c8g instances and the Apple M1 and found them consistent with those shown in the table. Hence, they appear correct.

In any case, a predictable, CPU-dominated benchmark such as summing sines shouldn't show nearly that much variation, is there a dynamic solver involved ? Do you have any ideas what causes execution time to vary from run to run?

The environment (i.e, other jobs, simulation settings, job submission script and executable) are identical. I was careful to try to maintain a common set of conditions in order to assess their relative performance. There is no dynamic solver. However, the sinusoids all have random parameters (amplitudes, frequencies, initial phases, and start times). I enforce limits on the range of frequencies and amplitudes and in some cases, iterate the random number generator to create a new random number that falls within the accepted range. Hence, there may be some variation in runs since the seed for the random number generation is set by the time stamp. However, I would not expect that amount of variation. I checked the log of the Apple M1 run whose elapsed time was a maximum and only found it iterated twice on the random number a few times (out of 100,000). Hence, I do not believe that is the source of the added elapsed time.

Since you are interested in core count, I assume your benchmark is multi- threaded. Have you tested the thread scalability?

The benchmark is single threaded, but for the 1000 simulations, I run 5 jobs at a time (200 jobs per submission).

Do you know if the simulation is slow to generate the sines, or sum them?

I have not measured it, but I believe the random number generation likely takes less time than creating the sum. The sum is a running sum (i.e , the sum of N sinusoids uses the prior sum of N-1 sinusoids). There are random numbers generated for the amplitude, freuqency, initial phase, and start time for each sinusoid. However, I compute the sum using 2^20 time points and hence it must add each time point. I suspect the generation of the 2^20 points (all done in memory) takes the greatest amount of time.

Is the code simple enough to share?

There is a source, include, and plotting routines directory with a gcc makefile if you are interested. It is more than a few lines of code, but not thousands of lines. I am happy to share it as a compressed zip file if you are interested.

You mention a 10 second period and 100,000 sinusoids:

The 10 second period is the width of the histogram bins. There are a 2^20 time points for each sinusoid.

I assume then that you have a 10kHz sample rate? Are you taking an FFT (FFTW? ArmPL?), or is the data already in the frequency domain?

The sample rate is actually 1e12 (sample time of 1 ps). The data is in the time domain only and there is no further processing (i.e., PSD , FFT, etc...).

so your code might be memory bandwidth bound or is generating these on the fly.

I have examined the memory usage and it is very small. CPU usage dominates the elapsed time.

Shawn

answered a month ago
  • so your code might be memory bandwidth bound or is generating these on the fly. I have examined the memory usage and it is very small. CPU usage dominates the elapsed time.

    To be clear, a code being memory-bandwidth bound will show those cores as busy, and may not have a large memory footprint. I recommend reading our guide on understanding hardware performance in the Graviton Getting Started guide: https://github.com/aws/aws-graviton-getting-started/blob/main/perfrunbook/debug_hw_perf.md

    Thanks for the updates in sample rates. Given 100K * 1MSamples your workload seems to need 100GFLOPS to sum the sinusoids (more to generate them). This should still be well less than a minute of execution time. However I did a quick test and as I suspected my code spends 97% of it's time generating the sinusoids, not summing them.

    I am also totally at a loss for what is causing the variance in run time. The amplitude and phase of the timeseries should not affect summation time. I recommend you take a look at APerf. Record for the entire benchmark duration (perhaps relax sample period to 10 or 30 seconds to avoid gathering too much data) and see if there are any other surprises during the benchmark.

0

Dear lrbison,

I wrote up a quick test code in python:

Your dedication is impressive - thank you!

The test suggests this single-threaded, simply numpy code can perform the 100, 000 sinusoids in about 23 minutes on Graviton3. It also suggests nearly all of that time is spent calling sin():

This is consistent with my comments earlier as I believed the dominant factor responsible for execution time is the computation of the sinusoidal sum and not the generation of random numbers nor the access times to the data arrays.

My recommendation is (1) think about if your actual workload needs to generate sin waves, or if this is just a benchmarking artifact,

Yes. I do need to generate sine waves. The motivation for the C code was to provide a time efficient solution to generating sums of sinusoids. Although we have been discussing reducing the execution time further, the compiled version execution times using an cg8.4xlarge EC2 instance are far less time than other methods (UNIX based script, Octave/MATLAB, circuit simulators, etc...).

(2) test your C code using ArmPL.

I will look into this option lrbison. Thank you once again!

smlogan

answered a month ago
  • the dominant factor responsible for execution time is the computation of the sinusoidal sum

    Clarification: it's not the sum... it's all sin(). To that end, (having not seen your C code) You might find that a more efficient vectorization using ArmPL's sin function is more easily identified by the compiler if the body of that loop is as short and simple as possible.

    I'm sure you also know this, but depending on what frequencies you are summing together, you might find it more efficient to generate your waveform in the frequency domain.

0

Dear Leo K and Oleski Bebych,

Thank you both for reading my post and providing your insights! I've include a couple of comments for your information.

For CPU-intensive simulations, you might want to consider other instance types that could potentially offer better performance:

Thank you. I did consider experimenting with other processor types (i.e. other C7x EC2 instances). I initially chose the c8g.4xlarge instance as it appeared to have slightly more CPU resources than the Apple M1 I am using.

Also, ensure that your code is optimized for the architecture you're running on. Code optimized for the M1's architecture might not perform as well on a different ARM-based processor without some adjustments. Lastly, consider factors like the compiler used, optimization flags, and the specific nature of your simulations, as these can all impact performance across different architectures.

This is a good point. I did not optimize the gcc compiler for the specific machine type, but used the same options for both the Apple M1 and Graviton processors. Thank you for the recommendation and links to added information.

I will try some of the EC2 instances you suggested and modify the compiler settings on a per machine basis. I will update the data when my experiments are complete in case you (or others) have an interest.

Thank you - once again!

smlogan

answered a month ago
0

Dear Leo K and Oleski Bebych,

I've completed the experiments proposed and compiled a new comparison of the results. A summary follows for you or for anyone else with an interest.

  1. For the EC2 using the Graviton4 processor, I was limited in the compiler options as only v11 of gcc is installed using the "sudo yum install gcc" procedure. The gcc compiler option recommended for AWS Graviton 4 processors is only available for gcc v13. Hence, I used the recommended option for Graviton 3 processors that is supported for gcc11 of -mcpu=neoverse-v1.
  2. For EC2 based on the Intel Xeon based processor, I did not use any special compiler flags.
  3. A summary of the processors and their memory for the experiments is shown in Table 1. A graphical summary of the elapsed CPU times for each of the experiments is included in Figure 1. The use of the Intel Xeon based EC2 appears to have the minimum elapsed CPU time and the minimum variance in elapsed time.

I hope this is of some interest and thank you, once again, for your time reading my post and providing your suggestions!

smlogan

Tabel 1 Figure 1

answered a month ago
0

Hi smlogan, the c7i.4xlarge instance type has 8 cores, not 16. It shows 16 vCPUs in the operating system, because HyperThreading is still supported by that processor model and enabled by default. Disabling HyperThreading may slightly increase performance for highly optimised and compute-intensive workloads. You can disable it by setting thread count to 1 in the CPU options of the EC2 instance. Graviton processors and newer x86-64 processors from AMD and Intel don't use HyperThreading, so the thread count is always 1 and the numbers of vCPUs and cores are the same.

From your results, it seems quite possible that no significant processor optimisations may be involved that wouldn't be implemented roughly as efficiently on all the processor models and generations. If that is so, you might get better performance simply by choosing a reasonably modern processor with a higher clock frequency in its processor class.

For example, the c7a.2xlarge is about 42% less expensive (possibly varying by region) than the c7i.4xlarge you used for testing, but it has 8 cores with a maximum sustained clock speed of 3.7 GHz for a total of 29.6 GHz. The more expensive c7i.4xlarge has 8 cores running at 3.2 GHz for a total of just 25.6 GHz. The c7a.4xlarge is about 15% more expensive than c7i.4xlarge but has 16 cores at 3.7 GHz for a total of 59.2 GHz.

It sounds likely the newer compiler may not improve performance significantly, if the operations your program performs are relatively well-established, but if you'd still like to give it a try, Ubuntu 24.04 LTS seems to have GCC 14 available, so it would be a straightforward way to test the newest compiler and Graviton optimisations.

EXPERT
answered a month ago
0

Dear Leo,

Thank you once again for your added insights!

Disabling HyperThreading may slightly increase performance for highly optimised and compute-intensive workloads. You can disable it by setting thread count to 1 in the CPU options of the EC2 instance.

I have create a new c71.4xlarge instance that disables hyperthreading and submitted the same jobs to compare the results. I did not realize that hyperthreading was enabled by default and am familiar with its use.

From your results, it seems quite possible that no significant processor optimisations may be involved that wouldn't be implemented roughly as efficiently on all the processor models and generations. If that is so, you might get better performance simply by choosing a reasonably modern processor with a higher clock frequency in its processor class.

I agree with your hypothesis and believe the performance of this application is clock speed/CPU core limited. Hence, your suggestions for alternative EC2 instances are useful to know. I think I will take a "pass" on verifying the impact of compiler settings using an EC2 Ubuntu instance for now...

Thank you! smlogan

answered a month ago
0

smlogan,

Thank you for posting the question. The performance of Graviton4 is surprising to me in this case. A few things stand out:

  • The variance across all tests (except the c7i) is quite large. You list std deviation as ~60 seconds for c8g, and 40 seconds for Apple M1, however the M1 has 4 results above 31 minutes, which I would expect would make std-deviation even larger than c8g. Is that calculation correct? In any case, a predictable, CPU-dominated benchmark such as summing sines shouldn't show nearly that much variation, is there a dynamic solver involved? Do you have any ideas what causes execution time to vary from run to run? A high-level performance monitoring tool such as APerf might help identify factors you didn't expect.
  • Since you are interested in core count, I assume your benchmark is multi-threaded. Have you tested the thread scalability? Is 8 threads nearly 8 times faster than 1 thread, or are there other bottlenecks?
  • Do you know if the simulation is slow to generate the sines, or sum them? Is the code simple enough to share? You mention a 10 second period and 100,000 sinusoids: I assume then that you have a 10kHz sample rate? Are you taking an FFT (FFTW? ArmPL?), or is the data already in the frequency domain?

For the sake of argument, let me look at the theoretical FLOPS performance for M1 and Graviton4 (Neoverse-v2):

  • Assuming you have 100k time-domain sinusoids at 10kHz and you want to sum all of them for a 10 second period, then you will need to perform 1 GFLOP of work. Neoverse-v2 SWOG states scalar FADD should have throughput of 4 operations per cycle, so a non-vectorized implementation should be able to achieve 2.8*4 GFLOP/s (clock speed * instruction throughput) assuming the data is effectively pre-loaded in cache. However the data you wish to operate on is 100k * 10kHz * 10s * 4bytes/sample = 4 GB, so your code might be memory bandwidth bound or is generating these on the fly. Even with loading the data the summation should happen in well under 1 second, so I am unclear what takes >20 minutes. This may further motivate you to look at APerf.
  • I was not able to find a definitive SWOG for M1, but online sources suggest it has the same 4-instruction FADD throughput as Graviton4, so the scalar FLOPS performance should be no more than 12% faster on M1 (based on clock rate alone), but in reality we rarely see FLOPS limiting performance, and more typically caches and other operations become important. If the compiler vectorizes the code to use NEON, then both platforms would improve by the vector width. Both have 4x128 bit neon, so FADD could improve from 4 to 16 FLOPS per clock cycle.
AWS
answered a month ago
EXPERT
reviewed a month ago
0

Dear Leo,

I'm not sure why, but my images did not upload properly. I am trying to re-post them. Please excuse the added post!

smlogan Table 2 Figure 2

answered a month ago
0

Dear Leo,

just to be sure, the very useful in-depth analysis of your measurement results and their dispersion was written by lrbison, not by me.

I apologize - both Leo and Irbison! Thank you Irbison for all your insights too!

Just adding to the general discussion, given your setup and the large variance in run times that lsbison pointed out, were you seeing CPU utilisation consistently hovering around 62.5% (for 5 cores out of 8 running at 100%) in the CloudWatch metrics for the instance while your test was running?

I assume the point of running only 5 processes/threads on several 8-core machines was to measure the relative performance of the processors.

Yes. The Cloudwatch plot of CPU usuage for the c7i.4xlarge EC2 instance is show as Figure 3.

I did not have detailed monitoring active.

smlogan

Figure 3

answered a month ago
0

Dear smlogan,

Thank you for your responses so far. I wrote up a quick test code in python:

#!/usr/bin/env python3

import numpy as np
import time

dtype_default = np.float64

def make_a_sine(rng:np.random.Generator, nsamples):
    fs = 1e12
    freq = rng.uniform(0,fs/2)
    ampl = rng.standard_normal()
    phase = rng.uniform(-np.pi, np.pi)
    t = np.arange(nsamples,dtype=dtype_default)*(1/fs)
    return np.sin(t*(2*np.pi*freq) + phase) * ampl

def benchmark():
    rng = np.random.default_rng()
    nsamples_per_sin = 2**20
    cumsum = np.zeros( (nsamples_per_sin,), dtype=dtype_default)
    t_bench = 5
    nsines = 0
    nsamples_proc = 0
    t0 = time.time()
    time_gen = 0
    while time.time() - t0 < t_bench:
        t0_gen = time.time()
        dat = make_a_sine(rng, nsamples_per_sin)
        time_gen += time.time() - t0_gen
        cumsum += dat
        nsamples_proc += len(dat)
        nsines += 1
    t_tot = time.time() - t0
    print(f"Summed {nsines} sinusoids in {t_tot:4.1f} seconds for a total of {nsamples_proc/t_tot/1000/1000:8.3f} MSamples per second")
    print(f"Estimate 100,000 sinusoids in {100000*(t_tot/nsines)/60:6.2f} minutes")
    print(f"{100*time_gen/t_tot:.1f}% of time spent generating data, not summing it.")

benchmark()

The test suggests this single-threaded, simply numpy code can perform the 100,000 sinusoids in about 23 minutes on Graviton3. It also suggests nearly all of that time is spent calling sin():

Summed 357 sinusoids in  5.0 seconds for a total of   74.761 MSamples per second
Estimate 100,000 sinusoids in  23.38 minutes
97.0% of time spent generating data, not summing it.

My recommendation is (1) think about if your actual workload needs to generate sin waves, or if this is just a benchmarking artifact, (2) test your C code using ArmPL. This should provide improved sin function. For example: gcc sample.c -lamath -lm. You can use ArmPL from either gcc or the arm compiler (armclang).

See:

AWS
answered a month ago
EXPERT
reviewed a month ago
0

Hi smlogan, just to be sure, the very useful in-depth analysis of your measurement results and their dispersion was written by lrbison, not by me.

Just adding to the general discussion, given your setup and the large variance in run times that lrbison pointed out, were you seeing CPU utilisation consistently hovering around 62.5% (for 5 cores out of 8 running at 100%) in the CloudWatch metrics for the instance while your test was running? I assume the point of running only 5 processes/threads on several 8-core machines was to measure the relative performance of the processors.

If you had detailed monitoring enabled for the EC2 instances, the measurements at 1-minute intervals are stored by CloudWatch for two weeks, so you should be able to inspect them even after terminating the test machines. Without detailed monitoring, the measurements are aggregated at a 5-minute granularity, but you can look at the minimum, maximum, and average values for each interval, of which the minimum and average values should give a good idea of if the test wasn't able to use the full compute capacity due to factors external to the CPU, such as the memory arrays or buses not being able to feed data fast enough to the processors.

EXPERT
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions