An overview of CUDA

#cuda #programming

CUDA is NVIDIA's programming language for controlling its graphics cards programmatically for computations other than rendering stuff to the screen. CUDA code is first compiled to "assembly language", a special kind thet NVIDIA calls PTX, and this PTX is just assembly for a specific generation of NVIDIA GPUs. PTX assembly code is not forward-compatible, which means you can't compile down CUDA code into an older PTX assembly language version and then (in most cases) run that on a newer generation GPU.

However, CUDA code itself is forward-compatible. You can write CUDA in such a way it works on many generations of GPUs. That's possible because CUDA is written in a C dialect language, and not in assembly.

CUDA Toolkit

CUDA Toolkit is a set of programs and libraries written by NVIDIA thst facilitate CUDA development. It includes the compiler (nvcc), debugger (nSight on windows, cuda-gdb on linux), and GPU drivers in addition to libraries, headers and other things.

The latest version at the time of this writing is 11.2. Newer major versions of the toolkit support newer GPU generarions. For example, 10.0 added support for RTX 20 series, and 11.0 added support for RTX 30 series.

You can't run a CUDA program compiled with one GPU generation on a system with a different GPU generation if you used the default settings, because by default the toolkit optimizes the program for that specific GPU. However, using some advanced settings, it is possible to make a binary that works on multiple different GPUs.

You should however be able to run a binary created with a different version of the toolkit than the one on the system it'll run on, provided the same GPU generation is on both systems.

How does a GPU run programs?

The way a CPU runs a program that is compiled down into a binary, is that it executes the series of instructions on a single core. More modern CPUs have hyperthreading which puts two threads in each core so it may actually run on one of those threads. There are also a handful of registers where the program can store variables it's using right now, a stack that has more variables and a stack of what functions we called to get to the statement we are in now, and L1, L2 and L3 caches that store yet more variables inside memory and are too large to fit in other storage places.

None of this is applicable to GPUs at all though. They have a different way of executing functions (not programs, CUDA doesn't support main() functions which means that CUDA needs to be initialized by a C/C++ function running on the CPU).

In a GPU, you don't have cores, you have blocks of several hundred threads executing a bunch of statements in parallel. Whereas you'd have to explicitly add parallelism to CPU programs to make them use all the cores, all CUDA programs are natually parallel with no user additions needed to make it work.

Let's talk more about blocks. Blocks all have an x,y,z index that uniquely identify the block. Blocks can be organized all in one line, or they can be in a rectangle format. At any rate, a group of blocks is caled a grid. The maximum grid size is 2^31-1 by 65535 by 65535. Of course you can't run that many threads at once. You are limited to executing as many blocks at once as your GPUs has Streaming Multiprocessors (see below).

Within each block, there are several threads in a block, which have an x,y,z index within the block. Due to engineering limitations, threads in the same block can't send data to each other (however they can send data to threads in other blocks). They all have to run in parallel. Threads also have a maximum dimension size, though it's much smaller than a block's dimension in modern GPUs it's 1024x1024x64 threads per block, up to 1024 threads in a block. Making a bigger grid size than 1024 threads will make the CUDA program fail to launch.

Blocks and threads aren't all that's needed to execute a CUDA program though. A GPU also has Streaming Multiprocessors, or SMs, which each run one block, and all of its threads, at a time. An SM has a queue of blocks that are scheduled to run on it, but it's only running one block at a time. The number of SMs in a GPU increases when newer GPUs are launched as NVIDIA makes engineering inprovements.

So a block (and by extension, a thread) is not analogous to a CPU core, but a block (and a thread) is more of a workload that is ran on the GPU. The SM, on the other hand, does resemble a CPU core in a way.

Inside an SM, when blocks are being executed, threads are organized into groups of 32. Each group of 32 threads is called a warp. The SM can execute blocks much faster if it can execute exactly 32 threads at the same time. This implies that developers should put a multiple of 32 threads per block.

Each GPU has a specific limit for how many warps can execute at once. This means that making larger-sized dimensions of threads does not make the program any more faster.

I intend to write more about CUDA in the coming weeks. To me it is a very fascinating topic and it gives you the power to parallelize many algorithms at a scale that previously has not been possible using CPUs.