Comments on: A quick programmer’s look at NVIDIA’s CUDA

By: Arno

Arno — Fri, 07 Aug 2009 12:06:36 +0000

I have production level code, that can , with very minor modifications, be run both on the CPU and the GPU using CUDA. Just one more difference: the code runs up to 350x faster on a dual GTX280 than it does on 4 cores of an Intel Xeon at 2.8 GHz. In both cases threading is used, with the same thread library and both graphics cards and all 4 cores run at 100% load during execution. I have found CUDA to be easier to deal with than PVM or MPI.

By: Oded Kuznik

Oded Kuznik — Wed, 28 Nov 2007 04:22:30 +0000

Hi Arun,

You mentioned that i can launch several kernels concurrently.
Can i do it on the same GPU?

Thanks

By: bart.kevelham.com » A different sound on GPGPU

bart.kevelham.com » A different sound on GPGPU — Tue, 13 Nov 2007 22:34:58 +0000

[…] http://www.serpentine.com/blog/2007/02/22/a-quick-programmers-look-at-nvidias-cuda/ One of the better reviews I’ve read. A must read!! “People with the expertise, persistence, and bloody-mindedness to keep slogging away will undoubtedly see phenomenal speedups for some application kernels.” That must be me I guess… […]

By: Sarnath

Sarnath — Mon, 15 Oct 2007 13:57:41 +0000

CUDA rocks!

By: Per

Per — Fri, 01 Jun 2007 19:37:00 +0000

Researching CUDA this summer (I have just started doing so with a team at Augustana University) and having no prior experience with GPU manipulation this provides a great springboard into branching math intensive portions of code into the GPU. From reading the manual, I have to agree with everyone’s opinion on where the difficulty lies.

Memory management/hierarchy.

I would like to point out though, that I don’t think that efficiency will be “unlikely” as Bryan has stated. All I’m seeing is code that allows “close to metal(Not a fan of the ATI CTM code)” code with a decent clarity of what portions of the GPU need to be addressed to accomplish the multi-threaded tasks. And as said before, I think the biggest problem is changing the logic of programming for a CPU to programming for a GPU.

By: ÐÐ»Ñ‘Ð½Ð° C++: NVidia CUDA Ð¸ ATI CTM

ÐÐ»Ñ‘Ð½Ð° C++: NVidia CUDA Ð¸ ATI CTM — Fri, 02 Mar 2007 18:30:13 +0000

[…] Ð¸ Ð¼Ð½ÐµÐ½Ð¸Ñ Ð¿Ñ€Ð¾ CUDA:NVIDIA CUDA Quick SummaryNVIDIA CUDA IntroductionA quick programmerâ€™s look at NVIDIAâ€™s CUDANvidia releases Cuda – and reinvents Stream Processing?G80 Architecture from CUDA – […]

By: jrk

jrk — Thu, 01 Mar 2007 23:41:57 +0000

I completely agree with Arun.

I’d also like to respond to the undercurrent throughout this discussion which came to the surface in Ian Ameline’s comment:

“The exposing of the banked shared memory to the programming model, is IMHO not a good idea â€” but from their standpoint it might not be such a bad one â€” it does tie CUDA pretty tightly to their architecture.”

Every fast/parallel machine today has a complex memory hierarchy, and leveraging it effectively is critical to not only implementing but designing efficient algorithms. After several years of struggling with a heavily abstracted, hidden, and very complex memory hierarchy, the GPGPU community has realized that explicit knowledge and moderate control of the memory hierarchy is *critical* to achieving high performance. Working against a driver that hides all the complexity under the hood is actually MUCH HARDER than simply making it explicit. CUDA not only makes it clearer to programmers when they’re going to fall off cliffs (because they very much do exist, whether or not they’re exposed in the programming model), it gives them vastly more powerful tools to simply and explicitly determine where they want to be with respect to those cliffs.

Sure, it would be nice if we could build machines which ran code efficiently with no concern for the intricacies of the memory system, but that simply isn’t true — *especially* not at the high-performance and massively parallel end of the application and hardware spectrum.

(Also cf. Sequoia, as others have mentioned.)

By: John Stone

John Stone — Thu, 01 Mar 2007 22:22:26 +0000

While some of the criticisms in this article are valid, I’ve found writing CUDA programs no more difficult than convincing vectorizing/SSE-capable compilers for regular CPUs to do something useful. For anyone that’s accustomed to the rigors of parallel programming with threads or message passing, I don’t think that CUDA presents any special challenges. In performance oriented multithreaded code one has to worry about hot spotting, false sharing, and lots of other “implicit” issues which relate to the memory system and program design. I actually find it much easier to deal with these things explicitly, as you have no doubts whatsoever that bank conflicts will negatively impact your performance. I think that the explicit exposure of the memory system makes it a lot easier to leverage for high performance coding. One thing people should keep in mind is that the hardware is what it is. Some algorithms just aren’t going to run well on GPUs, and so some of the “pain” people talk about may be the natural result of attempting to run inappropriate algorithms on hardware that’s ill suited to them. It’s probably much better to start with a clean slate than to immediately take your favorite C code and try and hack it into a CUDA program. I’ve found it best to accept the hardware as it is and use it efficiently for the tasks it’s ideally suited. I strongly recommend that new CUDA programmers read the NVIDIA documentation cover to cover, more than once, before bothering to write their first programs. Going into GPU programming without first doing the necessary background reading is rather like attempting to write thread-safe programs without knowing what mutex locks and condition variables are, IMHO. I think people will find CuDA and GPU programming in general harder to learn than other paradigms simply because they’ll find that much of what they’ve come to _assume_ is true about the performance and architecture of the underlying machine is false when running on a GPU. It’s sometimes hard to swallow the idea that on a GPU you’re often better off doing redundant calculations than adding in lots of branching to avoid work, or that it might be better for independent threads to duplicate effort rather than adding in lots of barrier synchronizations or collective operations to exchange partial results, etc. Once you get used to what the GPU _likes_ to do, writing code for it is much simpler.

Cheers,
John

By: Ian Ameline

Ian Ameline — Wed, 28 Feb 2007 16:57:20 +0000

Peakstream’s API looks much cleaner, but currently they completely take over the gpu — it cannot be used to attach a display — so if you are doing 3D, you need a second GPU. That pretty much rules them out for me — at least until they fix that limitation.

Another one to look at is RapidMinds — it builds on McCool’s earlier work.

CUDAs great advantage in my mind is its ability to send the results of its computations directly into the gfx rendering pipeline.

The exposing of the banked shared memory to the programming model, is IMHO not a good idea — but from their standpoint it might not be such a bad one — it does tie CUDA pretty tightly to their architecture.

People interested in data parallel programming should also look closely at Intel’s Thread Building Blocks (TBB) library.

By: Stream Programmer

Stream Programmer — Wed, 28 Feb 2007 00:58:53 +0000

Has anyone looked at PeakStream’s API? They too run on GPU’s, and have a far simpler programming model.