I'll be trying to use this blog to catalogue experiences in computing. At the moment, I'm departing from the well trodden paths of CUDA and starting to adventure into OpenCL. I found a few tutorials online, so I hope to use this blog (at least initially) to add to that literature, and hopefully ease someone else's passage into OpenCL as well as help to track my own progress. There may be many similarities between these notes and those of other tutorials, but I have also found significant differences between my OpenCL installation and what other tutorials suggest is possible, possibly just as a result of different code versions. Currently, the OpenCL standard is at 1.1, which specifies the library's contents, while AMD's implementation of OpenCL is at revision 2.5 (though the machine I am using has 2.4 installed). I hope that I can also add information of my own. Note that I will be approaching OpenCL as an alternative to CUDA for GPUs, so I probably won't get into programming multi-core CPUs or other architectures. To start with, here are a few tutorials that I found helpful:
Pretty detailed tutorial, a few things I haven't liked so far about how the platforms are set up but definitely very helpful
Brief introduction, only uses CPUs I think
Huge collection of tutorials from an AMD conference
OpenCL Reference Pages
That last link contains a number of PDF and video tutorials under the Sessions tab. If you're just getting started, I would absolutely recommend the tutorial 1001: "Introduction to OpenCL". It's a brief (~40 minutes) but pretty good introduction to OpenCL, particularly understandable if you already have some background in GPU computing but I think pretty straightforward otherwise as well. I should also add that some of these tutorials have conflicting function calls with eachother and with the AMD SDK samples. I'm not sure yet what the proper way to do things is yet (i.e. setup OpenCL platforms and devices) but I'll let you know as soon as I do.
To get started, this is what that video tutorial taught me (I'll try to draw analogies to CUDA wherever I can, at least as I understand it).
An AMD GPU is composed of SIMD units, each of which has n processors. This is the same as CUDA with streaming multiprocessors. I'm not sure yet how # of execution units compare between AMD and NVIDIA GPUs, though I have a vague recollection that AMD uses fewer cores (don't hold me to that).
The finest element of the OpenCL programming model is called a work element. A work element is executed by a single thread, and contains a series of instructions to execute (the kernel). It can access global memory on the GPU device, as well as a piece of what the tutorial refers to as local memory (CUDA: __shared__ memory). Work items are grouped together into work groups. A work group ensures that all items contained execute on the same SIMD on a device, and work items in the same group (i.e. on the same SIMD) can all access the same local memory but cannot access the local memory of another SIMD/work group. Barriers are possible within a work group (CUDA: __synthreads()) but not between work groups (though the tutorial hinted at atomics just like CUDA).
So, on the low-level side of things, each SIMD has N processors (I think the tutorial mentioned that 16 is a standard number). Obviously, this means you want your minimum work group size to be 16, but it should be more to help overlap work groups to hide latency. However, there is this additional concept of a wavefront in OpenCL. It seems that if you have a work group with size > N (where N is the # of processors) then what a SIMD unit will do is fetch the next instruction for this work group, and run that instruction not just for N work items in the work group but for N*M where M is some integer (in the tutorial they use the example of N=16,M=4). This means that even though you only have 16 processors in a SIMD, you sort of get 64 work items executing in lock step. I imagine this might also decrease the cost of fetching instructions as you fetch once for a greater number of work items. The tutorial doesn't explain why this concept of a wavefront is added, but it seems equivalent to a warp of threads in CUDA, even though the work items technically aren't executing in lockstep. To summarize, even though you have N processors per SIMD OpenCL will run N*M work items at once by repeating an instruction M times for N work items.
Looking at OpenCL from a higher perspective, it has this concept of platforms and devices. A platform is "The host plus a collection of devices managed by the OpenCL framework", so this could be your CPU plus a number of GPUs (devices). Using platforms+devices controls exactly which piece of hardware a command is being issued to.
Commands (i.e. copying data, launching computation, etc) are issued to OpenCL from the host via command queues (CUDA streams). These command queues are associated with an OpenCL device. From the tutorial, queues can be ordered or non-ordered, meaning that they can force the order that you place items in the queue to be the order they are executed, or not. They do provide synchronization mechanisms from the host so that even in unordered command queues you can be certain that all of the previously issued commands have completed.
I think that's enough information for a quick overview post. I would definitely recommend taking a look at the "Introduction to OpenCL" tutorial video, it was very helpful in going from nothing to some understanding of OpenCL.