In this final part of the CUDA overview series I will go over the different types of device memory, which are similar to having different C++ variable types such as
Shared memory can be accessed faster than global memory and is particularly suited to large local variables that would otherwise have to use device memory on the GPU. Shared memory doesn't reside there, rather it resides on the GPU chip itself, and consists of a series of equal-sized banks which can be accessed simultaneously.
The GPU tries to organize all shares memory allocations in such a way that as many of them exist in separate memory banks as possible, and this allows it to multiply throughput by the number of memory banks as the number of shared memory allocations being accessed at the same time.
In te case of two shared memory allocations residing on he same memory bank, accessing both of them at the same time results in a bank conflict (sometimes called a 2-way bank conflict) where one of the memory accesses must be serviced first, causing code that relies on the other to wait and therefore reducing throughput. Three shared memory allocations on the same memory bank might cause a 3-way memory conflict, and so on.
Shared memory is defined by using the
__shared__ qualifier and it make n object that is shared by all the threads in a block, but diffrent copies of the object for different blocks. It's scope is the lifetime of the block.
The address of a `shared` variable is subject to change and should not be treated as constant. This is because shared memory is moved around memory banks as I described above. Shared variables also cannot have an initialization value, but it is alright to assign to them a variable later on.
Variables declared with the
__managed__ variable indicate that the variable in question is managed in such a way by CUDA that it can be called from both host and device programs without errors. This implies that in the global scope, you cannot take a pointer or C++ reference of a managed variable since its address might become invalid later. However, taking its pointer or reference within a function's local scope is OK.
Also, just like CUDA runtime functions, you cannot use managed variables outside the scope of the program's
main() function. As they are managed by the CUDA runtime, their values will be undefined when it doesn't exist. The variables are also shared among GPUs. One copy of each managed variables is shared among all the GPUs. This is despite that
__managed__ cannot be used together with
__shared__ (to be more specific,
__constant__ cannot be used together).
You also cannot use managed variables at all in constructors, destructors, as values of C++ template variables, or with
For all purposes, if you use
__constant__ when defining a variable, it puts it in the constant memory space of the GPU instead of the global memory space. It is otherwise equivalent to variables in global memory. Constant memory is read-only.
cudaMemcpyFromSymbol(void *ptr, void *devicePtr, size_t size) and
cudaMemcpyToSymbol(void *devicePtr, void *ptr, size_t size) can be used to copy variables between the host and device memory. Also, you can get the address and size of any variable from device code using
cudaGetSymbolSize() respectively. This works whether the variable in device memory resides in global or constant space. However, these functions are provided by the CUDA runtime and can only be called from host space. These functions cause undefined behavior if called from device space.
We have reached the end of the Overview of CUDA series. In the future I would like to cover advanced topics such as performance optimization and the highlights of each compute capability, but I will save that for another week.
If you see any errors in this post please let me know so I can correct them.