A quick programmer’s look at NVIDIA’s CUDA

Posted on 2007-02-22 by Bryan O'Sullivan

I spent a while this evening reading through the documentation for the beta release of NVIDIA’s CUDA GPGPU system. My motivation for this was that nvcc, the CUDA compiler, is based on a code drop of the EkoPath compiler, which I’ve worked on intermittently over the past few years.

The programming model that these GPUs enforce is incredibly complex. It’s more than a little reminiscent of the Connection Machine (for a blast from the past, see a collection of scanned CM-5 documentation.

The idea is that the GPU executes a “kernel” of compute-intensive, highly parallelisable, code on behalf of the CPU. Data is transferred to the GPU when a kernel starts to execute, and back to the host when it completes. The GPU may execute multiple kernels simultaneously, if it is capable of it.

A kernel is executed as a collection of threads, with threads organised into “blocks”. The blocks are further organised into “grids”. Each thread has thread-local storage, and a block of threads shares a chunk read-write memory. This is how they communicate and synchronise with each other. Threads can also access the GPU’s global memory, although blocks of threads cannot synchronise with each other via global memory.

Global memory has a pile of constraints. There’s not one single kind of global memory; instead, there are three. Unlike the thread-local and block-shared memories, global memories can be accessed by the host (although accesses are expensive).

Plain old read-write memory. Accesses to this memory are not cached. When a thread block is scheduled to execute, and accesses global memory, its performance is heavily penalised if threads don’t access global memory in a suitable pattern (see section 6.1.2.1 of the Programming Guide).
“Constant” memory is read-only, and reads are cached. Again, there are restrictions on access patterns to constant memory that affect performance.
“Texture” memory is also read-only, and cached. Cached texture data is laid out to give best performance for 2D access patterns.

Block-shared memory isn’t immune to peculiar performance constraints, either. It’s organised into “banks”, to allow parallel access to multiple banks in a single operation. In other words, n threads can access n banks in one cycle. However, concurrent accesses to a single bank are forcibly serialised, so n threads hitting a single bank will take n cycles.

So at every level of the GPU memory hierarchy, there are multiple factors that you have to keep in mind if you want to achieve good performance.

The extensions that NVIDIA has made to the C language are fairly minimal. There are some new keywords to control layout of data, to determine where variables should live in the memory hierarchy. Not surprisingly, there are new vector types that look very much like OpenGL vector types.

A few similar keywords control function layout. Functions can be marked as device accessible only, resident on the device but called from the host (i.e. a kernel), or host-only. A weird piece of syntax `<<< Dg, Db, Ns >>>’ is required when calling a kernel, to control the execution parameters for the kernel: grid size, block size (and number of threads), and memory allocation.

With hardware of this complexity, a good optimising compiler can make a substantial difference to application performance. The heritage of NVCC is unimpeachable. It’s based on the EkoPath compiler, which has been the most frequent compiler used for SPEC benchmark submissions by AMD64 hardware vendors for several years. The EkoPath compiler was in turn based on Open64, a compiler that SGI GPLed when it dropped out of the compiler business.

Among the memory hierarchy related optimisations in the Open64 compiler family that are available to NVCC are the following (this is just a short list of highlights; the real number is big). I don’t know how many of these are enabled in NVCC, nor do I know enough about individual cache sizes or miss penalties to have a clue as to which ones are likely to be effective.

Loop nest optimisation. For a set of nested loops, this can change the order in which the inner and outer loops are executed, to improve the pattern of access to memory.
Vectorised intrinsics. If application code is, for example, computing sin(x[i]) for all i in a vector, the compiler can replace this with a single call to a highly optimised sin specialised for vectors.
Cache blocking. Replace a single loop over a large vector with smaller loops that operate on cache-sized chunks of the vector.

Given the degree of manual control over variable placement that NVIDIA’s C extensions seem to enforce, it’s not clear to me that their compiler team has had a chance to automate any of the transfer of data between levels of the memory hierarchy yet.

I also find it telling that the programmer’s guide includes specific guidelines on how to avoid bank conflicts in block-shared memory, where in at least some of these cases it’s clear that the compiler could be automating the job.

It’s worth reading the programmer’s guide in its entirety to get a sense of just how complex CUDA is, and how many different constraints the determined application programmer will have to keep in mind at a time. There’s not a lot of abstraction going on here (vendor-provided BLAS and FFT libraries notwithstanding).

We have plenty of previous examples of hardware that failed to live up to their early marketing promise, from the i860 to the PS3. CUDA looks set to follow in their footsteps: I expect that it will take vast amounts of work for programmers to get halfway decent performance out of a CUDA application, and that few will achieve more than 10% of theoretical peak performance.

People with the expertise, persistence, and bloody-mindedness to keep slogging away will undoubtedly see phenomenal speedups for some application kernels. I’m sure that the DOE and NSA, in particular, are drooling over this stuff, as are the quants on Wall Street. But those groups have a tolerance for pain that is fairly unique. This technology is a long way from anything like true accessibility, even to those already versed with parallel programming using environments like MPI or OpenMP. Still, it’s a great first step.

“Now, imagine a Beowulf cluster of these!”

Posted in hardware, software

13 comments on “A quick programmer’s look at NVIDIA’s CUDA”

Happy FFTer says:

2007-02-22 at 05:38

Am I to assume then that the provided BLAS & FFT libraries abstract out all the necessary memory optimizations that someone performing their own calculations would have to consider? I’ve only read the provided CUFFT documentation and it seems that if your interface mimics fftw it would provide considerable speedup (though as you said, on the order of which no one is sure).
Dan P says:

2007-02-22 at 09:31

I needed a summary and yours is at the perfect level of granularity for what I want. Thanks!
Arun Demeure says:

2007-02-23 at 11:59

Let’s just say I disagree completely! 🙂 Of course, having more experience with GPU architecture is helping me here, but my sincere belief is that CUDA is ridiculously simple, and that you can very easily achieve high efficiency with it.

I know the docs are overcomplicating, so here’s a basic summary that hopefully makes it evident how simple it can be…

You can launch one or multiple kernels to the GPU, and each kernel has to be divided in blocks of 64 to 256 threads for optimal performance. Shared memory, along with a basic synchronization primitive, lets the threads in a single block communicate with each other. No communication between blocks is easily possible.

There practically is only one scalar ALU per “processor” and it can handle all of the chip’s math ops. There is nothing to be gained from vector operations. Fast and precise approximations of sin/cos/log/exp and others exist on a separate ALU.

IMO, there are only three significant factors that you need to keep in mind to write efficient CUDA code:
– Shared memory bank conflicts. Hard for some workloads, but easy to handle for most of them, afaics.
– Branch coherence; all paths executed by one thread in a group of 32 will have to be executed by the 31 other ones.
– Put read-only 1D data in constants, and read-only 2D data in textures. Not a substantial gain, but it’s there, I’m sure.

The one thing I disagree the most with, however, is that you claim achieving high efficiency will be basically impossible. You’d be surprised how easy it is to get excellent efficiency compared to most other processors out there. Keep this in mind:
– All operations are scalar, and are executed by the same ALU, excluding the approximated sin/cos/etc.
– The way kernels and blocks are exposed makes for a very efficient programming model once you get your head around it.
– Cache misses and such are much less expensive than on CPUs, because GPUs are made with latency tolerance in mind. The justification behind caching is bandwidth, not latency.

So, what you get is very high performance/mm2 and very high efficiency. Obviously, it is NOT suitable for all workloads. The idea is that it’s fundamentally a SIMD machine with extra capabilities, including shared memory.

I firmly agree that a summarized version of the docs would have been a great help for everyone who wants to get involved with CUDA. I’d argue that once you have a firm notion of the concept though, it’s really quite simple.

You also shouldn’t try to overoptimize, except for “obvious” things like bank conflicts. You’d be surprised how high your efficiency can be for naive implementations that don’t use caching at all, and only shared memory for some basic stuff. Of course, choosing the right algorithm for the architecture remains very important, but that’s also the case on CPUs. And it’s arguably not much harder to do so, either.
Wes Felter says:

2007-02-26 at 12:26

Have you looked at Stanford’s Sequoia? It looks much more convenient than programming CUDA or Cell by hand.
Bryan O'Sullivan says:

2007-02-26 at 12:57

Wes –

Yes, I know Pat Hanrahan, and was aware of Sequoia. It’s a nice piece of work.
Stream Programmer says:

2007-02-27 at 16:58

Has anyone looked at PeakStream’s API? They too run on GPU’s, and have a far simpler programming model.
Ian Ameline says:

2007-02-28 at 08:57

Peakstream’s API looks much cleaner, but currently they completely take over the gpu — it cannot be used to attach a display — so if you are doing 3D, you need a second GPU. That pretty much rules them out for me — at least until they fix that limitation.

Another one to look at is RapidMinds — it builds on McCool’s earlier work.

CUDAs great advantage in my mind is its ability to send the results of its computations directly into the gfx rendering pipeline.

The exposing of the banked shared memory to the programming model, is IMHO not a good idea — but from their standpoint it might not be such a bad one — it does tie CUDA pretty tightly to their architecture.

People interested in data parallel programming should also look closely at Intel’s Thread Building Blocks (TBB) library.
John Stone says:

2007-03-01 at 14:22

While some of the criticisms in this article are valid, I’ve found writing CUDA programs no more difficult than convincing vectorizing/SSE-capable compilers for regular CPUs to do something useful. For anyone that’s accustomed to the rigors of parallel programming with threads or message passing, I don’t think that CUDA presents any special challenges. In performance oriented multithreaded code one has to worry about hot spotting, false sharing, and lots of other “implicit” issues which relate to the memory system and program design. I actually find it much easier to deal with these things explicitly, as you have no doubts whatsoever that bank conflicts will negatively impact your performance. I think that the explicit exposure of the memory system makes it a lot easier to leverage for high performance coding. One thing people should keep in mind is that the hardware is what it is. Some algorithms just aren’t going to run well on GPUs, and so some of the “pain” people talk about may be the natural result of attempting to run inappropriate algorithms on hardware that’s ill suited to them. It’s probably much better to start with a clean slate than to immediately take your favorite C code and try and hack it into a CUDA program. I’ve found it best to accept the hardware as it is and use it efficiently for the tasks it’s ideally suited. I strongly recommend that new CUDA programmers read the NVIDIA documentation cover to cover, more than once, before bothering to write their first programs. Going into GPU programming without first doing the necessary background reading is rather like attempting to write thread-safe programs without knowing what mutex locks and condition variables are, IMHO. I think people will find CuDA and GPU programming in general harder to learn than other paradigms simply because they’ll find that much of what they’ve come to _assume_ is true about the performance and architecture of the underlying machine is false when running on a GPU. It’s sometimes hard to swallow the idea that on a GPU you’re often better off doing redundant calculations than adding in lots of branching to avoid work, or that it might be better for independent threads to duplicate effort rather than adding in lots of barrier synchronizations or collective operations to exchange partial results, etc. Once you get used to what the GPU _likes_ to do, writing code for it is much simpler.

Cheers,
John
jrk says:

2007-03-01 at 15:41

I completely agree with Arun.

I’d also like to respond to the undercurrent throughout this discussion which came to the surface in Ian Ameline’s comment:

“The exposing of the banked shared memory to the programming model, is IMHO not a good idea â€” but from their standpoint it might not be such a bad one â€” it does tie CUDA pretty tightly to their architecture.”

Every fast/parallel machine today has a complex memory hierarchy, and leveraging it effectively is critical to not only implementing but designing efficient algorithms. After several years of struggling with a heavily abstracted, hidden, and very complex memory hierarchy, the GPGPU community has realized that explicit knowledge and moderate control of the memory hierarchy is *critical* to achieving high performance. Working against a driver that hides all the complexity under the hood is actually MUCH HARDER than simply making it explicit. CUDA not only makes it clearer to programmers when they’re going to fall off cliffs (because they very much do exist, whether or not they’re exposed in the programming model), it gives them vastly more powerful tools to simply and explicitly determine where they want to be with respect to those cliffs.

Sure, it would be nice if we could build machines which ran code efficiently with no concern for the intricacies of the memory system, but that simply isn’t true — *especially* not at the high-performance and massively parallel end of the application and hardware spectrum.

(Also cf. Sequoia, as others have mentioned.)
Per says:

2007-06-01 at 12:37

Researching CUDA this summer (I have just started doing so with a team at Augustana University) and having no prior experience with GPU manipulation this provides a great springboard into branching math intensive portions of code into the GPU. From reading the manual, I have to agree with everyone’s opinion on where the difficulty lies.

Memory management/hierarchy.

I would like to point out though, that I don’t think that efficiency will be “unlikely” as Bryan has stated. All I’m seeing is code that allows “close to metal(Not a fan of the ATI CTM code)” code with a decent clarity of what portions of the GPU need to be addressed to accomplish the multi-threaded tasks. And as said before, I think the biggest problem is changing the logic of programming for a CPU to programming for a GPU.
Sarnath says:

2007-10-15 at 06:57

CUDA rocks!
Oded Kuznik says:

2007-11-27 at 20:22

Hi Arun,

You mentioned that i can launch several kernels concurrently.
Can i do it on the same GPU?

Thanks
Arno says:

2009-08-07 at 08:06

I have production level code, that can , with very minor modifications, be run both on the CPU and the GPU using CUDA. Just one more difference: the code runs up to 350x faster on a dual GTX280 than it does on 4 cores of an Intel Xeon at 2.8 GHz. In both cases threading is used, with the same thread library and both graphics cards and all 4 cores run at 100% load during execution. I have found CUDA to be easier to deal with than PVM or MPI.

5 Pings/Trackbacks for "A quick programmer’s look at NVIDIA’s CUDA"

Thinking Parallel » Blog Archive » A beautiful summary of NVIDIA’s CUDA by Bryan O’Sullivan says:

2007-02-23 at 05:05

[…] O’Sullivan has a beautiful summary of the present state of NVIDIA’s CUDA. He explains the programming model, along with the many different levels of memory and their […]
A more in depth look at nVidia’s CUDA SDK | insideHPC says:

2007-02-23 at 09:18

[…] O’Sullivan has a detailed look into the new SDK nVidia has announced that’s aimed at making programing the GPU in their […]
poundbang - Harish Mallipeddi's Blog » Interesting links from delicious says:

2007-02-23 at 12:50

[…] teideal glic deisbhÃ©alach Â» Blog Archive Â» A quick programmerâ€™s look at NVIDIAâ€™s CUDANvidia is really considering GPGPU seriously! […]
ÐÐ»Ñ‘Ð½Ð° C++: NVidia CUDA Ð¸ ATI CTM says:

2007-03-02 at 10:30

[…] Ð¸ Ð¼Ð½ÐµÐ½Ð¸Ñ Ð¿Ñ€Ð¾ CUDA:NVIDIA CUDA Quick SummaryNVIDIA CUDA IntroductionA quick programmerâ€™s look at NVIDIAâ€™s CUDANvidia releases Cuda – and reinvents Stream Processing?G80 Architecture from CUDA – […]
bart.kevelham.com » A different sound on GPGPU says:

2007-11-13 at 14:34

[…] http://www.serpentine.com/blog/2007/02/22/a-quick-programmers-look-at-nvidias-cuda/ One of the better reviews I’ve read. A must read!! “People with the expertise, persistence, and bloody-mindedness to keep slogging away will undoubtedly see phenomenal speedups for some application kernels.” That must be me I guess… […]