SIMD in Depth - Performance and Cost in C# and C++

This is a follow on from a previous post explaining what SIMD is, how .NET developers can use it, and what performance gains they can expect to see. While the previous post showed how to take advantage of SIMD, this post will give a deeper understanding of what happens when you use it in both C# and C++ by looking at the assembly code that's actually executed. This should help in understanding why you get the performance gains you do, why you may not get as much performance as you'd expect, and enable you to debug any problems. It also highlights the different costs and benefits of using SIMD in these languages and illustrates a few reasons why C++ should always have the upper hand in producing maximally optimised executable code.

As with the previous post, all code for this one is available on GitHub.

The disassembly for this post was generated using Microsoft Visual Studio Community 2015 with Update 1, C# 6.0, RyuJIT 4.6.1055.0 and Visual C++ 2015.

Viewing Assembly/Disassembly

Using Visual Studio it's possible to view a disassembly of the machine code instructions that are being executed. This is a close reflection of the assembly that the compiler (C++ or .NET JIT) produced and so it helps with understanding the performance of your application. Before viewing the disassembly you need to configure a few settings for C# projects (for full details, see here).

Firstly, you need the C# compiler to produce optimised CIL, so select the Release configuration.
Release builds don't generate a .pdb file by default, but you'll need one so the debugger can correlate locations in the disassembly back to lines in your source code. In the project properties select the Build tab, click the Advanced... button at the bottom, and in the dialog that opens set Debug Info to pdb-only.
You'll also want to ensure the JIT compiler produces optimised native code, so from Tools -> Options -> Debugging -> General ensure that Suppress JIT optimization on module load is unchecked.
Finally, optimised code is not considered 'My Code', so to prevent the debugger from stepping over it go again to Tools -> Options -> Debugging -> General and ensure that Enable Just My Code is also unchecked

Note, the last two settings are global, and therefore apply to all solutions.

Now when you run your application with the debugger attached you can break on any line and see the corresponding disassembly by going to Debug -> Windows -> Disassembly. You can step through the disassembly instructions just as you would with C# code.

Despite the fact that C# is compiled to machine code at runtime while for C++ this happens at compile time you can view the disassembly in the same way, by breaking on any line and going to Debug -> Windows -> Disassembly. The debugger can seem to behave strangely for fully optimised code, for example skipping over dead code, but understanding this behaviour actually gives an insight into how your code has been optimised. More on this below.

C# Disassembly

Simple addition

The simplest case I'll look at is the addition of two floats without using SIMD.

float result = x + y;

Which is compiled to the following assembly:

movss       xmm0,dword ptr [7FE95BF2EC8h]  ; load x into xmm0 
addss       xmm0,dword ptr [7FE95BF2ECCh]  ; add y to xmm0

Here, as annotated, you can see one assembly instruction to load x into the CPU register xmm0, and one to add y to the contents of xmm0, storing the result in xmm0.

The machine that produced the above implements the AVX instruction set extension which means it's capable of performing 128-bit SIMD arithmetic for most numeric types and 256-bit SIMD arithmetic for single precision floats. Notice here it uses the 128-bit, SIMD capable register xmm0 even for this non-SIMD addition. This is often the case. If you've got an AVX2 capable processor you may see it using 256-bit registers that begin with ymm.

SIMD addition

Next, here is the disassembly for SIMD addition of four floats.

var xs = new[] { 1.0f, 2.0f, 3.0f, 4.0f };
var ys = new[] { 0.1f, 0.2f, 0.3f, 0.4f };
var result = new Vector<float>(xs) + new Vector<float>(ys);

Omitting the assembly that loads the values into the array memory, here are the instructions that correspond to the Vector<float> creation and addition:

movups      xmm0,xmmword ptr [rsi+10h]  ; load xs into xmm0
movups      xmm1,xmmword ptr [rax+10h]  ; load ys into xmm1
addps       xmm0,xmm1                   ; add xmm1 to xmm0

This time there's one assembly instruction to load xs into xmm0, one to load ys into xmm1, and one to perform the addition. The calculation happens in the same registers as before and adding four floats in this way is clearly more efficient than repeating the previous example four times.

The above disassembly was also generated on a machine with an AVX processor. I only added four pairs of floats even though the hardware is capable of processing eight at a time. For numeric types other than float AVX can only perform 128-bit SIMD operations, but RyuJIT also limits float arithmetic to 128-bits for AVX (RyuJIT does use 256-bit operations for all numeric types on AVX2 processors). It's possible that future versions of RyuJIT will address this limitation, and this is an example of how SIMD algorithms written now could become more performant in the future without being re-written, compiled or deployed.

SIMD array addition

The gains are tiny in isolated calculations like these, but things get more interesting if we perform addition in a loop.

int[] x = GenerateRandomIntArray(1000);
int[] y = GenerateRandomIntArray(1000);
var result = new int[1000];
for (var i = 0; i < 1000; i += Vector<int>.Count) {
    var vx = new Vector<int>(x, i);
    var vy = new Vector<int>(y, i);
    (vx + vy).CopyTo(result, i);
}

is JIT compiled for an AVX processor to:

; array initialisation omitted...

xor         edx,edx                          ; initialise edx (loop counter i) to zero

; LOOP_START
mov         ecx,dword ptr [rsi+8]            ; load vx.Length into ecx
cmp         edx,ecx                          ; if i >= vx.Length
jae         000007FE95B958E7                 ; throw IndexOutOfRangeException
lea         r8d,[rdx+3]                      ; load i+3 into r8d
cmp         r8d,ecx                          ; if i+3 >= vx.Length
jae         000007FE95B958E7                 ; throw IndexOutOfRangeException
movups      xmm0,xmmword ptr [rsi+rdx*4+10h] ; load vx[i..i+3] into xmm0

mov         ecx,dword ptr [rdi+8]            ; load vy.Length into ecx
cmp         edx,ecx                          ; if i >= vy.Length
jae         000007FE95B958E7                 ; throw IndexOutOfRangeException
cmp         r8d,ecx                          ; if i+3 >= vy.Length
jae         000007FE95B958E7                 ; throw IndexOutOfRangeException
movups      xmm1,xmmword ptr [rdi+rdx*4+10h] ; load vy[i..i+3] into xmm1

paddd       xmm0,xmm1                        ; perform SIMD addition of xmm0 and xmm1
mov         ecx,dword ptr [rax+8]            ; load result.Length into ecx
cmp         edx,ecx                          ; if i >= result.Length
jae         000007FE95B958EC                 ; throw ArgumentException
cmp         r8d,ecx                          ; if i+3 >= result.Length
jae         000007FE95B958F1                 ; throw ArgumentException

movups      xmmword ptr [rax+rdx*4+10h],xmm0 ; more result out of xmm0 into the result array

add         edx,4                            ; increment loop counter, i, by 4
cmp         edx,3E8h                         ; if i < 1000 (0x3E8)
jl          000007FE95B9589A                 ; go back to LOOP_START

As shown in the previous post, the above algorithm executes about twice as fast as a simple loop that doesn't use SIMD (on a processor with AVX). Looking at the above assembly you can see one reason why it's not a clean 4x improvement. The .NET CLR performs array bounds checks so that indexing errors throw an IndexOutOfRangeException. This is the case regardless of using SIMD, however for the above SIMD code there are superfluous checks. Before copying the four values vx[i..i+3] into xmm0 there is a check that both i and i+3 are within the upper bound of vx. However, if i+3 is within this bound then i must also be, and so it's wasteful to check. There are similar superfluous checks before copying the second operand, vy[i..i+3], into xmm1 and the result of the calculation out of xmm0 into the result array. RyuJIT should improve here in a future release, which again shows how the performance of a SIMD algorithms compiled today could improve in the future without being rebuilt or re-deployed.

Even if you slow down the simple non-SIMD loop by adding extra bounds checks (and ensuring they're not optimised out) so it's equivalent to the SIMD loop, you'll probably still not see a 4x speed-up. This is because it's likely that the arithmetic processing in the SIMD algorithm is no longer the bottleneck. The values in the array must be moved from memory (depending on the size of the array this could be from RAM, or even further away) to the processor's SIMD registers and back again. Chances are that something in this pipeline (e.g. memory bandwidth, memory speed, cache-speed) has now become the bottleneck in the loop's execution and so increasing the bandwidth of the arithmetic processing beyond this no longer yields any improvement. However, it's another case where the relative performance of a SIMD algorithm can improve in the future without being rewritten. As future processors improve these bottlenecks will be removed and the advantage due to using SIMD will increase.

Using SIMD in C++

With C++ it's possible to take full control of the SIMD intrinsics that are executed on the CPU without needing to use inline assembly, or alternatively you can instruct the compiler which processor architecture to build for and it will work out when and how to use SIMD. While the .NET JIT compiler ensures that only instructions that the target processor implements are executed on it, C++ developers have no such help and must ensure this manually. Therefore, while you benefit from more performant executables, you pay for this with more complex code and/or build and deployment.

Here is a look at the disassembly from some C++. The following code adds two arrays of ints without anything specific to invoke SIMD intrinsics.

const auto testSetSize = 50;
int* x = CreateRandomIntArray(testSetSize);
int* y = CreateRandomIntArray(testSetSize);
auto result = new int[testSetSize];
for (auto i = 0; i < testSetSize; i++) {
       result[i] = x[i] + y[i];
}

This was compiled using Visual C++ 2015 with Full Optimisation (/Ox), Enable Intrinsic Functions enabled (/Oi) and AVX2 processor architecture (/arch:AVX2) to the following:

; array initialisation and loop setup omitted...

; SIMD_LOOP_START
vmovdqu     ymm1,ymmword ptr [rax-20h]           ; load 8 ints (256 bits) from x into 256-bit register ymm1
vpaddd      ymm1,ymm1,ymmword ptr [rcx+rax-20h]  ; add 8 ints from y to those in ymm1 and store result back in ymm1
vmovdqu     ymmword ptr [r8+rax-20h],ymm1        ; move result out of ymm1 into the result array
vmovdqu     ymm2,ymmword ptr [rax]               ; load the next 8 ints from x into ymm2
vpaddd      ymm1,ymm2,ymmword ptr [rcx+rax]      ; add the next 8 ints from y to those in ymm2 and store the result in ymm1
vmovdqu     ymmword ptr [r8+rax],ymm1            ; move the result out of ymm1 into the result array
lea         rax,[rax+40h]                        ; increment the array indexer by 16 ints (64 bytes)
sub         r9,1                                 ; decrement the loop counter
jne         main+120h                            ; if loop counter != 0 go back to SIMD_LOOP_START

; SIMPLE_LOOP_START
mov         ecx,dword ptr [rbx+rax]              ; load one int from x into ecx
add         ecx,dword ptr [rax]                  ; add one int from y to the value in ecx and store the result in ecx
mov         dword ptr [rdx+rax],ecx              ; move the result out of ecx into the result array
lea         rax,[rax+4]                          ; increment the array indexer by one int (4 bytes)
sub         rdi,1                                ; decrement the loop counter
jne         main+160h                            ; if loop counter != 0 go back to SIMPLE_LOOP_START

Notice that the compiler, as mandated by the /Oi and /arch:AVX2 options, has seen the potential to optimise the algorithm using SIMD, and done so by performing auto-vectorisation. Just as with the custom array addition algorithm from the previous post, there are two loops in this assembly: one that performs SIMD addition in chunks of eight ints at a time, and a second loop that sums the remaining elements at the end of the array.

The above assembly will run fine on an AVX2 architecture, but it will cause run-time errors on older processors. If you need to deploy to multiple architectures then one approach is to isolate SIMD capable algorithms into a separate .dll, compile different versions of that .dll for each target architecture, and ensure the appropriate .dll is deployed at installation.

Alternatively, you can move some complexity from the deployment/installation process into the source code by deploying the .dlls for all processor architectures to all machines, detecting the current processor's architecture at run time (see MSVC++ specific details), and loading the correct .dll as appropriate.

Finally, you can move all the complexity from build and deployment into the source code by keeping all SIMD algorithms in the same binary. This makes for a simpler build, but you lose the power of the C++ compiler. You must not only perform run-time checks to determine the current processor's architecture, you must also take low level control of processor instructions to be used for SIMD algorithms. Here is an example of array addition that can utilise AVX when available, and fall back to a simple algorithm otherwise.

const auto testSetSize = 1000;
int* x = CreateRandomIntArray(testSetSize);
int* y = CreateRandomIntArray(testSetSize);
auto result = new int[testSetSize];

if (AvxAvailable()) {
    const auto intsPerSIMD = 4;
    const auto bytesPerSIMD = 16;
    __m128i vx, vy;
    for (auto i = 0; i < testSetSize; i += intsPerSIMD) {
        memcpy(vx.m128i_i32, x + i, bytesPerSIMD);
        memcpy(vy.m128i_i32, y + i, bytesPerSIMD);
        memcpy(result + i, _mm_adds_epi16(vx, vy).m128i_i32, bytesPerSIMD);
    }
}
else {
    for (auto i = 0; i < testSetSize; i++) {
        result[i] = x[i] + y[i];
    }
}

Comparison of C# and C++

The C++ compiler was able to improve the performance of the compiled machine code using auto-vectorisation, and looking at the compiled assembly from the first C++ example above you can see two more optimisations that make the machine code faster than that produced by the .NET JIT compiler for a similar algorithm.

There are no bounds checks on the array access. C++ allows you to read (and write) beyond the end of the array you're indexing into. Doing so would almost certainly be a mistake (or something malicious) but the lack of safeguards makes for faster executables.
The SIMD loop has been un-rolled, so that each run through the loop actually executes two SIMD additions. The loop is therefore executed half as many times, and this speeds up execution. It also makes the debugger act a bit strange. If you put a break point on the addition within the for loop it will only be hit five times, not 50. This is because using SIMD eight int pairs are summed in each addition. This happens twice within the unrolled loop, so after three times through the loop we've hit the breakpoint three times and summed 48 int pairs. The final two int pairs are summed in the non-SIMD loop, which hits the break point two more times.

While these optimisations could theoretically be used by the .NET JIT compiler it takes a lot of work during the time sensitive JIT process to ensure they are safe. For example, consider how the JIT compiler would remove the array bounds checking. It would need to:

Analyse the loop termination condition and prove that it prevents the array indexer from indexing outside the array.
Prove that the array indexer cannot be modified from anywhere else, including another thread, during the loop's execution.
Prove that the array itself cannot be moved or resized from elsewhere during the loop's execution.

All this is often possible for the JIT compiler, but it takes time, and if performed for every loop would slow down load times of .NET apps. The trade-off for faster execution is not currently considered worthwhile. The same is true for auto-vectorisation which means that, currently, if .NET developers wish to take advantage of SIMD they must use the SIMD enabled types in System.Numerics, or perform vectorisation manually using Vector<T>.

Article By

Eoin Mullan

Principal Engineer