How Do Graphics Cards Execute Vector Instructions?

reg__ profile image Adam Sawicki Originally published at asawicki.info on ・5 min read

Intel announced that together with their new graphics architecture they will provide a new API, called oneAPI, that will allow to program GPU, CPU, and even FPGA in an unified way, and will support SIMD as well as SIMT mode. If you are not sure what does it mean but you want to be prepared for it, read this article. Here I try to explain concepts like SIMD, SIMT, AoS, SoA, and the vector instruction execution on CPU and GPU. I think it may interest to you as a programmer even if you don't write shaders or GPU computations. Also, don't worry if you don't know any assembly language - the examples below are simple and may be understandable to you, anyway. Below I will show three examples:

1. CPU, scalar

Let's say we write a program that operates on a numerical value. The value comes from somewhere and before we pass it for further processing, we want to execute following logic: if it's negative (less than zero), increase it by 1. In C++ it may look like this:

float number = ...;
bool needsIncrease = number < 0.0f;
  number += 1.0f;

If you compile this code in Visual Studio 2019 for 64-bit x86 architecture, you may get following assembly (with comments after semicolon added by me):

00007FF6474C1086 movss  xmm1,dword ptr [number]   ; xmm1 = number
00007FF6474C108C xorps  xmm0,xmm0                 ; xmm0 = 0
00007FF6474C108F comiss xmm0,xmm1                 ; compare xmm0 with xmm1, set flags
00007FF6474C1092 jbe    main+32h (07FF6474C10A2h) ; jump to 07FF6474C10A2 depending on flags
00007FF6474C1094 addss  xmm1,dword ptr [__real@3f800000 (07FF6474C2244h)]  ; xmm1 += 1
00007FF6474C109C movss  dword ptr [number],xmm1   ; number = xmm1
00007FF6474C10A2 ...

There is nothing special here, just normal CPU code. Each instruction operates on a single value.

2. CPU, vector

Some time ago vector instructions were introduced to CPUs. They allow to operate on many values at a time, not just a single one. For example, the CPU vector extension called Streaming SIMD Extensions (SSE) is accessible in Visual C++ using data types like __m128 (which can store 128-bit value representing e.g. 4x 32-bit floating-point numbers) and intrinsic functions like _mm_add_ps (which can add two such variables per-component, outputting a new vector of 4 floats as a result). We call this approach Single Instruction Multiple Data (SIMD), because one instruction operates not on a single numerical value, but on a whole vector of such values in parallel.

Let's say we want to implement following logic: given some vector (x, y, z, w) of 4x 32-bit floating point numbers, if its first component (x) is less than zero, increase the whole vector per-component by (1, 2, 3, 4). In Visual C++ we can implement it like this:

const float constant[] = {1.0f, 2.0f, 3.0f, 4.0f};
__m128 number = ...;
float x; _mm_store_ss(&x, number);
bool needsIncrease = x < 0.0f;
  number = _mm_add_ps(number, _mm_loadu_ps(constant));

Which gives following assembly:

00007FF7318C10CA  comiss xmm0,xmm1  ; compare xmm0 with xmm1, set flags
00007FF7318C10CD  jbe    main+69h (07FF7318C10D9h)  ; jump to 07FF7318C10D9 depending on flags
00007FF7318C10CF  movaps xmm5,xmmword ptr [__xmm@(...) (07FF7318C2250h)]  ; xmm5 = (1, 2, 3, 4)
00007FF7318C10D6  addps  xmm5,xmm1  ; xmm5 = xmm5 + xmm1
00007FF7318C10D9  movaps xmm0,xmm5  ; xmm0 = xmm5

This time xmm registers are used to store not just single numbers, but vectors of 4 floats. A single instruction - addps (as opposed to addss used in the previous example) adds 4 numbers from xmm1 to 4 numbers in xmm5.

It may seem obvious, but it's important for future considerations to note that the condition here and the boolean variable driving it (needsIncrease) is not a vector, but a single value, calculated based on the first component of vector number. Such a single value in the SIMD world is also called a "scalar". Based on it, the condition is true or false and the branch is taken or not, so either the whole vector is increased by (1, 2, 3, 4), or nothing happens. This is how CPUs work, because we execute just one program, with one thread, which has one instruction pointer to execute its instructions sequentially.

3. GPU

Now let's move on from CPU world to the world of a graphic processor (GPU). Those are programmed in different languages. One of them is GLSL, used in OpenGL and Vulkan graphics APIs. In this language there is also a data type that holds 4x 32-bit floating-point numbers, called vec4. You can add a vector to a vector per-component using just '+' operator.

Same logic as in section 2. implemented in GLSL looks like this:

vec4 number = ...;
bool needsIncrease = number.x < 0.0;
  number += vec4(1.0, 2.0, 3.0, 4.0);

When you compile a shader with such code for an AMD GPU, you may see following GPU assembly: (For offline shader compilation I used Radeon GPU Analyzer (RGA) - free tool from AMD.)

v_add_f32      v5, 1.0, v2      ; v5 = v2 + 1
v_add_f32      v1, 2.0, v3      ; v1 = v3 + 2
v_cmp_gt_f32   vcc, 0, v2       ; compare v2 with 0, set flags
v_cndmask_b32  v2, v2, v5, vcc  ; override v2 with v5 depending on flags
v_add_f32      v5, lit(0x40400000), v4  ; v5 = v4 + 3
v_cndmask_b32  v1, v3, v1, vcc  ; override v1 with v3 depending on flags
v_add_f32      v3, 4.0, v0      ; v3 = v0 + 4
v_cndmask_b32  v4, v4, v5, vcc  ; override v4 with v5 depending on flags
v_cndmask_b32  v3, v0, v3, vcc  ; override v3 with v0 depending on flags

You can see something interesting here: Despite high level shader language is vector, the actual GPU assembly operates on individual vector components (x, y, z, w) using separate instructions and stores their values in separate registers like (v2, v3, v4, v0). Does it mean GPUs don't support vector instructions?!

Actually, they do, but differently. First GPUs from decades ago (right after they became programmable with shaders) really operated on those vectors in the way we see them. Nowadays, it's true that what we treat as vector components (x, y, z, w) or color components (R, G, B, A) in the shaders we write, becomes separate values. But GPU instructions are still vector, as denoted by their prefix "v_". The SIMD in GPUs is used to process not a single vertex or pixel, but many of them (e.g. 64) at once. It means that a single register like v2 stores 64x 32-bit numbers and a single instruction like v_add_f32 adds per-component 64 of such numbers - just Xs or Ys or Zs or Ws, one for each pixel calculated in a separate SIMD lane.

Some people call it Structure of Arrays (SoA) as opposed to Array of Structures (AoS). This term comes from an imagination of how the data structure as stored in memory could be defined. If we were to define such data structure in C, the way we see it when programming in GLSL is array of structures:

struct {
  float x, y, z, w;
} number[64];

While the way the GPU actually operates is kind of a transpose of this - a structure of arrays:

struct {
  float x[64], y[64], z[64], w[64];
} number;

It comes with an interesting implication if you consider the condition we do before the addition. Please note that we write our shader as if we calculated just a single vertex or pixel, without even having to know that 64 of them will execute together in a vector manner. It means we have 64 Xs, Ys, Zs, and Ws. The X component of each pixel can be less or not less than 0, meaning that for each SIMD lane the condition may be fulfilled or not. So boolean variable needsIncrease inside the GPU is not a scalar, but also a vector, having 64 individual boolean values - one for each pixel! Each pixel may want to enter the if clause or skip it. That's what we call Single Instruction Multiple Threads (SIMT), and that's how real modern GPUs operate. How is it implemented if some threads want to do if and others want to do else? That's a different story...

Posted on by:

reg__ profile

Adam Sawicki


All opinions are my own and do not reflect that of my employer.


Editor guide