The taste of CUDA
Recently I was working on some projects using CUDA. Perhaps I will write something more about this experience. Today just a few things to notice:
– CUDA is quite low level. If you want to test your algorithm fast, do learn some wrapping library. I tried Thrust – and it is an excellent library. I’m speaking here about runtime API: you really shouldn’t touch driver API if you don’t have to.
– There is a lot of publicity around CUDA, especially no NVidia sites. But then if you want to get some real information it can be hard to impossible.
– You cannot file a bug report if aren’t registered developer. Becoming registered developer is no small feat though: judging from application form you have to work in some company on some CUDA stuff, provide a web page describing your product (!) and some other silly stuff. I still didn’t get any reply for my application.
– The compiler (NVCC) has it’s bugs. Only recently I found two of them. One bug was nice enough to provide me with compiler error. Fair enough, perhaps a couple of XOR’s are too much to handle. Another was more severe: turns out sometimes the usage of -use_fast_math nvcc’s switch doesn’t really help. In my case the image that should be the same each frame was mostly random:
Even worse: the actual speed of the code was 100 times worse with -use_fast_math.
Another thing, not sure if this is bug or not: compiling template-heavy debugging code is a nightmare. Each target architecture could take 15 minutes and ~2 Gb (top) memory to compile – at least this was my experience.
There is plenty of things that are either broken or work poorly here. But on the other hand: CUDA is free as long as you have NVidia card and it really does terrific amount of work under the hood. In specific applications it can perform calculations dramatically faster than CPU.
Unfortunately it’s extremely easy to shoot your foot here – so badly you may have to reboot your machine, because the driver got stuck. In the long run I would expect some heavy movement towards higher level languages, most likely with some declarative/functional flavor. (No, C++ is not functional. Boost::lambda is a bad joke, and writing new class each time you want functor is stupid.)
I think one may came up with something like query-execution-evaluator: the programmer would specify the result value and the compiler would then device the best way to calculate it on given machine configuration and data set. All hairy stuff would then go away, since I really believe it’s too much to handle in the long run.
So far people that use CUDA seem to come from backgrounds that are used to bad code, since they need performance badly. If we want some other people to join in, we need better tools.