A PHD in HPC: October 2011

Contexts and Command Queues

I briefly mentioned command queues in the first post, but I don't think I said anything about contexts (mostly because I don't think I truly understood them).

OpenCL contexts seem to be a way to group together a host system with devices in such a way as to enable giving commands to them (basically the programmer saying "these are the pieces of hardware which I will be using through this particular command queue"). I say this because, when constructing a command queue (refer to the first post for some resources on what a command queue is as well as a brief description), the constructor must be passed a context. In order to create a context we use the clCreateContext function, which basically associates a collection of devices in a certain platform with a single OpenCL context. Then, we can create a command queue which applies to a single device in that context using the clCreateCommandQueue function.

Inserting the following code after the code from the previous post results in full initialization of a context and command queue:

This code creates a single OpenCL context associated with a single OpenCL device, and then attaches a command queue to it. I haven't tried this yet, but I imagine if you passed a device to clCreateCommandQueue which was not associated with the context passed, some error would result.

Discovering Platforms and Devices

In OpenCL, all work must be associated with a platform+device. This is a little different from CUDA in that with CUDA, you are handed a default device, and only if you go around playing with cudaSetDevice or the device contexts with the driver API can you get access to more than that one device. OpenCL, on the other hand, adds some burden up front in initializing the devices in exchange for the programmer having more control over what platforms/devices are being used and probably more readable+safe code in that everything is stated explicitly.

The two functions that I have found which seem to be most useful in doing this are clGetPlatformIDs and clGetDeviceIDs. These functions allow you to retrieve identifiers for platforms and devices, as well as the number of platforms/devices available. To illustrate this, let's start with some sample code I just wrote:

This code discovers how many platforms there are, and there fetches platform identifiers for each platform into the platforms array (clGetPlatformIDs). A similar process can be done for devices on a single platform:

int main(void)
{
        // Detect how many platforms there are and retrieve platform ids for each
        cl_int err;
        cl_uint num_platforms;
        err = clGetPlatformIDs(0, NULL, &num_platforms);
        checkErr(err, "clGetPlatformIDs(0,NULL,&num_platforms)");
        std::cout<<"Detected "<<num_platforms<<" platforms"<<std::endl;
        if(num_platforms == 0) {
                std::cout<<"Unable to find a valid OpenCL platform"<<std::endl;
                exit(-1);
        }
        cl_platform_id *platforms = (cl_platform_id *)malloc(sizeof(cl_platform_id) * num_platforms);
        err = clGetPlatformIDs(num_platforms, platforms, NULL);
        checkErr(err, "clGetPlatformIDs(num_platforms, platforms, NULL)");

cl_platform_id platform = platforms[0];
        cl_device_id device;
        cl_uint num_gpu_devices, num_cpu_devices;
        err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &num_gpu_devices);
        checkErr(err, "GPU clgetDeviceIDs(&num_devices)");
        err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_CPU, 0, NULL, &num_cpu_devices);
        checkErr(err, "CPU clgetDeviceIDs(&num_devices)");
        std::cout<<"Detected "<<num_gpu_devices<<" GPU device(s) and "<<num_cpu_devices<<" CPU device(s) attached to platform 0"<<std::endl;
        if(num_gpu_devices == 0 || num_cpu_devices == 0) {
                std::cout<<"Insufficient OpenCL devices found for platform 0"<<std::endl;
                exit(-1);
        }
        err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

return 0;
}

With this code, we can now retrieve information on all platforms, and the devices associated with those platforms (depending on what platform is set to) (clGetDeviceIDs). To build this code, place it in a .cpp file and compile using:

where AMDAPPSDKROOT is the top directory where the OpenCL SDK is. x86_64 may need to be changed to x86 depending on your platform.

Init()

I'll be trying to use this blog to catalogue experiences in computing. At the moment, I'm departing from the well trodden paths of CUDA and starting to adventure into OpenCL. I found a few tutorials online, so I hope to use this blog (at least initially) to add to that literature, and hopefully ease someone else's passage into OpenCL as well as help to track my own progress. There may be many similarities between these notes and those of other tutorials, but I have also found significant differences between my OpenCL installation and what other tutorials suggest is possible, possibly just as a result of different code versions. Currently, the OpenCL standard is at 1.1, which specifies the library's contents, while AMD's implementation of OpenCL is at revision 2.5 (though the machine I am using has 2.4 installed). I hope that I can also add information of my own. Note that I will be approaching OpenCL as an alternative to CUDA for GPUs, so I probably won't get into programming multi-core CPUs or other architectures. To start with, here are a few tutorials that I found helpful:

Pretty detailed tutorial, a few things I haven't liked so far about how the platforms are set up but definitely very helpful

Brief introduction, only uses CPUs I think

Huge collection of tutorials from an AMD conference

OpenCL Reference Pages

That last link contains a number of PDF and video tutorials under the Sessions tab. If you're just getting started, I would absolutely recommend the tutorial 1001: "Introduction to OpenCL". It's a brief (~40 minutes) but pretty good introduction to OpenCL, particularly understandable if you already have some background in GPU computing but I think pretty straightforward otherwise as well. I should also add that some of these tutorials have conflicting function calls with eachother and with the AMD SDK samples. I'm not sure yet what the proper way to do things is yet (i.e. setup OpenCL platforms and devices) but I'll let you know as soon as I do.

To get started, this is what that video tutorial taught me (I'll try to draw analogies to CUDA wherever I can, at least as I understand it).

An AMD GPU is composed of SIMD units, each of which has n processors. This is the same as CUDA with streaming multiprocessors. I'm not sure yet how # of execution units compare between AMD and NVIDIA GPUs, though I have a vague recollection that AMD uses fewer cores (don't hold me to that).

The finest element of the OpenCL programming model is called a work element. A work element is executed by a single thread, and contains a series of instructions to execute (the kernel). It can access global memory on the GPU device, as well as a piece of what the tutorial refers to as local memory (CUDA: __shared__ memory). Work items are grouped together into work groups. A work group ensures that all items contained execute on the same SIMD on a device, and work items in the same group (i.e. on the same SIMD) can all access the same local memory but cannot access the local memory of another SIMD/work group. Barriers are possible within a work group (CUDA: __synthreads()) but not between work groups (though the tutorial hinted at atomics just like CUDA).

So, on the low-level side of things, each SIMD has N processors (I think the tutorial mentioned that 16 is a standard number). Obviously, this means you want your minimum work group size to be 16, but it should be more to help overlap work groups to hide latency. However, there is this additional concept of a wavefront in OpenCL. It seems that if you have a work group with size > N (where N is the # of processors) then what a SIMD unit will do is fetch the next instruction for this work group, and run that instruction not just for N work items in the work group but for N*M where M is some integer (in the tutorial they use the example of N=16,M=4). This means that even though you only have 16 processors in a SIMD, you sort of get 64 work items executing in lock step. I imagine this might also decrease the cost of fetching instructions as you fetch once for a greater number of work items. The tutorial doesn't explain why this concept of a wavefront is added, but it seems equivalent to a warp of threads in CUDA, even though the work items technically aren't executing in lockstep. To summarize, even though you have N processors per SIMD OpenCL will run N*M work items at once by repeating an instruction M times for N work items.

Looking at OpenCL from a higher perspective, it has this concept of platforms and devices. A platform is "The host plus a collection of devices managed by the OpenCL framework", so this could be your CPU plus a number of GPUs (devices). Using platforms+devices controls exactly which piece of hardware a command is being issued to.

Commands (i.e. copying data, launching computation, etc) are issued to OpenCL from the host via command queues (CUDA streams). These command queues are associated with an OpenCL device. From the tutorial, queues can be ordered or non-ordered, meaning that they can force the order that you place items in the queue to be the order they are executed, or not. They do provide synchronization mechanisms from the host so that even in unordered command queues you can be certain that all of the previously issued commands have completed.

I think that's enough information for a quick overview post. I would definitely recommend taking a look at the "Introduction to OpenCL" tutorial video, it was very helpful in going from nothing to some understanding of OpenCL.

A PHD in HPC

Wednesday, October 19, 2011

Contexts and Command Queues

Discovering Platforms and Devices

Init()

About Me

Blog Archive