An overview of CUDA, part 3: Memory alignment

#cpp #cuda #programming

Hello and welcome back to my CUDA series. This week let's talk about aligning device memory in CUDA, a concept which is frequently misunderstood and the cause of slow code.

CUDA memory alignment

GPU memory is accessed in groups of 32 bits, 64 bits, and 128 bits (4 bytes, 8 bytes and 16 bytes respectively). If the size of the data being moved around is not a multiple of one of these values, say if you were moving a 1, 2 or 3 character string, then that array is going to be padded with an appropriate number of null values to round it up to one of those multiples. The CUDA runtime tries to make as few memory accesses as possible because more memory accesses reduce the number of moving and copying instructions that can occur at once (the throughput).

So effeftively, when array pointers are not aligned, memory accesses could be slower. In the case of 64- and 128- bit alignment, it may even produce the wrong results.

You might ask yourself, how is that possible for arrays since pointers returned by cudaMalloc() are aligned? Well let's take a look at an example. Suppose you allocate space for one large in-memory array of structures or even other unaligned-sized elements, and you increment the pointer to return a handle to additional array elements. Those pointers will not be aligned so accessing them will reduce throughput. This can be mitigated by rounding up the allocated element size by the nearest multiple of 16, 8 or 4 bytes.

Also, arrays of structures whose totsl size of all members aren't aligned to a multiple of one of these sizes will also suffer from unaligned access slowness.

The __align__(n) qualifier can be used before structs and classes to force alignment of all members to n bits. For CUDA's purposes, n should be some multiple of 4 e.g 4, 8, 12, 16, ... with 16-byte alignment being preferred if your problem size is not too big. For example:

struct __align__(16) {
    float x;
    float y;
    float z;
};

Addresses of variables declared globally (within a CUDA file):

   int my_array { 1, 2, 3, 4};

or memory allocated with cudaMalloc() is always aligned to a 32-byte or 256-bit boundary, but it may for example be aligned to a larger boundary such as 512-bit or 1024-bit.

Some local variables defined in functions would use too many GPU registers and thus are stored in memory as well. Examples are very large arrays and structures defined locally. These variables also must use aligned memory accesses or they will be subject to the same latency penalties as global memory.

Accodring to the CUDA Programmer's Manual:

Local memory is however organized such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable).

To my understanding, this means if you have an array organized like this:

         int a[6]
    Thread
       0     1       2
    +-----+-----+
a[0]|  1  |  1  |
    +-----+-----+
a[1]|  2  |  2  |    ....
    +-----+-----+
a[2]|  3  |  3  |
    +-----+-----+

         ....

each int access will be aligned by 32 bits (4 bytes), so that each thread that accesses the same array element, simultaneously, will be using an aligned access. The same can be said for structs if array elements are replaced by member variables, and array elements which are not a multiple of 32 bits wide, such as char or a structure that has some int and a char in it for example, will be padded until they are, enabling them to have aligned accesses as well.

My guess is that aligning local memory-bound variables allows 8 128-bit memory accesses to be made internally to access all of the 32 copies of a 4-byte element for the 32 threads in a warp, keeping in mind that threads in a block cannot under any circumstances share the same copies of local variables, the topic of the next subject below.

Can threads in different blocks access variables in other blocks?

The answer is no. If a thread in a different block wants to access a variable located in some other block, you should put the variable in global space and use & reference notation to get its pointer.

Synchronizing thread execution

At any point, all the threads in the same grid (that is to say, all the threads in all the blocks) can be synchronized by inserting a CUDA function called __syncthreads() anywhere within a function. __syncthreads() acts as a serialization point and ensures that all threads have finished executing the code before that function before they all proceed to the code after it. This prevents race conditions where some threads are writing into global memory that other threads are reading from.

2- and 3- dimensional dynamic arrays

The cudaMallocPitch(void **mem, size_t *pitch, size_t width, size_t height) and cudaMalloc3D(cudaPitchPtr *ptr, cudaExtent extent) functions allow you to conveniently and safely allocate 2- and 3- dimensional arrays respectively, which meet the alignment requiremenrs of in-memory CUDA variables. In the later funciton cudaMalloc3D(), the cudaExtent type is the one returned from calling make_cudaExtent(size_t width, size_t height, size_t depth). The array and pitch in this case are located in ptr->ptr and ptr->pitch respectively. The width, height and depth indices can be used along with the stride to access the array using a[x + y*pitch + z*pitch*height], where pitch is the value returned by cudaMallocPitch() and cudaMalloc3D(). This x value correponds to the width, such that incrementing the index by 1 position advances the array by one width unit, the y value correponds to the height, such that incrementing the index by pitch positions advances the array by one height unit, and if a 3D array was allocated, z corresponds to the depth and incrementing the index by pitch*height positions advances the array by one depth unit.

This indexing scheme is easy to memorize and allows you to do indexing with pointers, which, besides being returned and passed by more functions, are equivalent to using arrays. The equivalent indexing scheme using pointer arithmetic with a pointer a would be (a + x + y*pitch + z*pitch*height). Of course, using the pitch in this way ensures that memory accesses are aligned.

Conclusion

Now you have learned about how CUDA aligns GPU memory accesses, how to safely allocate two-and three-dimensional arrays and how to synchronize thread execution to protect from multiple threads reading and writing into the same shares memory location at the same time.

If you see any errors in this post, please let me know so I can correct them.