In my last post I gave an overview of differences in the way GPUs execute code from a CPU, and how an NVIDIA GPU compiles down CUDA code into an intermediate assembly language called PTX before it assembles them into binaries.
Today I will now show you the most important features of CUDA programs, threads, blocks and host and device code with examples straight from the CUDA Toolkit documentation itself, Stack Overflow, and other places.
Host vs device code
Since a CUDA program is running code on both the CPU and the GPU, there's got to be a way for the developer and the program itself to identify which functions are supposed to run on the CPU and which run on the GPU.
CUDA calls code that is slated to run on the CPU host code, and functions that are bound for the GPU device code. You can tell the two of them apart by looking at the function signatures; device code has the __global__
or __device__
keyword at the beginning of the function, while host code has no such qualifier. It is after all supposed to run on the CPU so having a mandatory qualifier at the beginning when the rest of the internet is compiling without any at all doesn't make much sense. Sometimes you will equivalently see __host__
, but this is an optional keyword.
When you use __device__
before a variable, you are making a variable that exists in what's capled the global memory space of the GPU, which can be accessed directly from threads in all blocks, and is accessible to threads in all blocks and the host using special CUDA functions that can copy variables to different places (these are latency-heavy operations though, so they should only be used suring times when performence is not necessary, such as the beginning and end of a program).
One copy of global memory-bound variables exist per GPU.
There is also the __host__
keyword. Equivalent to not using any keyword at all, this defines a plain-old C/C++ function. And finally, using __host__ __device__
together enables functions to run on both the CPU and GPU.
Generally it is more efficient to separate CPU-bound and GPU-bound code because you will then be able to optimize the GPU code separately.
Want to know more about the difference between using __global__
and __device__
? Look no further, because Stack Overflow user FacundoGFlores explained the difference very well in this answer which I shall quote below:
Differences between device and global functions are:
device functions can be called only from the device, and it is executed only in the device.
global functions can be called from the host, and it is executed in the device.
Therefore, you call device functions from kernels functions, and you don't have to set the kernel settings. You can also "overload" a function, e.g : you can declare void foo(void) and device foo (void), then one is executed on the host and can only be called from a host function. The other is executed on the device and can only be called from a device or kernel function.
You can also visit the following link: http://code.google.com/p/stanford-cs193g-sp2010/wiki/TutorialDeviceFunctions, it was useful for me.
If we go to the Google Code link mentioned in the quote, then we find a function that uses __device__
:
__device__ float my_device_function(float x) { return x + 1; }
Not exactly the most useful CUDA function, but it demonstrates how similar CUDA functions are to normal ones. This just increments a value by 1, but it increments this value on all threads executed inside the same block.
Here's another example of a global function and a regular function:
// Kernel definition
/*
* This is a global function that can be called from C land
*/
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
/*
* This takes an N-by-N array of floats `A` and `B`
* and exports the result of `A+B` into `C`.
*/
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
/*
* As you saw above, we do not iterate through all the
* values of `0..n`, we take a unique pair of numbers
* assigned to each thread of the block, and only add
* that index. Once we exit a global function, CUDA makes sure that the results from
* all of the threads are gathered together when all
* the threads have finished running and assembles back
* together any arrays that have been split apart by
* the block during the phase it divided work among the threads.
*
* The results of all the variables are put in the block which
* is then given to the SM in charge of running that block.
* From there the SM can make these result variables available
* to the calling function from C/C++ land.
*/
int main()
{
...
// Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
// This is the syntax for calling global functions.
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}
This above snippet runs N-by-N copies of the same function. I said N-by-N because threads can have an x-index and a y-index. This is mainly to make it convenient for people who work on two-dimensional graphics so that they don't have to construct the x- and y- coordinates by hand.
It is also useful for working on two-dimensional arrays in general. The thread's dimensions can also take a z- axis which enables you to naturally process 3D arrays and other material.
The threadsPerBlock()
syntax takes three arguments for the x
, y
and z
dimensions respectively. It uses the spdcial type dim3
that CUDA provides to indicate a type of three dimensions. While the numBlocks
variable was declared as an int that declares in how many blocks you want to run the code in. In this case we specify that the whole addition operarion is to run in the block. Now if we wanted to, we could make the computation use more blocks to increase this value, but the CUDA code in question has to use the index of that block in order to actually make use of the multiple blocks avilable.
Now, block numbers can be specified in dimensions the same way that threads can be written, only with x-,y- sizes instead of x-,y-,z-, while using the same dim3
type as threadsPerBlock()
. It would be more accurate at this point to call numBlocks
the blocksPerGrid
, because the two-dimensional size is the number of blocks that'll run in each grid (a grid is just a grouping of blocks).
Triple angle bracket syntax <<<blocksPerGrid, threadsPerBlock>>>
is used before a function call to specify the thread and block sizes that it will run with.
The following code shows how you'd specify block dimensions to work on large problem sizes.
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
/*
* Note the use of `blockIdx` to access block dimensions,
* as opposed to `threadIdx` for thread dimensions.
* We also have `blockDim` available to give us the
* dimensions of a block within the device.
*/
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
// Kernel invocation
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}
Epilogue
This week (a day late I know, there were disruptions in my network) we took a look at the thread and block model that CUDA uses, and the differences between host, device and so-called "global" code that serves as an entry point between host and device code. I have more CUDA-related material in store next week, so stay tuned.
Top comments (0)