Monday, April 30, 2012

The Connexions Importer Project

This past semester I had the chance to work with Kathi Fletcher on the Connexions Importer. Kathi’s work with the Importer, which is supported by the Shuttleworth Foundation, focuses on building an easy to use content importer for Connexions (an open repository of educational material at cnx.org) which enables content generators (teachers, professors, writers, etc) to upload documents of a variety of formats into Connexions. This removes boundaries to the distribution of free resources, worldwide. As a member of Kathi’s international team, I had several responsibilities throughout the semester.

One main area I worked in was bug fixes. This initially served as a good way to become familiar with the code base being developed, while still contributing to the stability of the project. As I became more familiar with the code, I was able to take on more complex bug fixes. Among the issues I worked on this semester were:

  1. Unicode Errors (92+93): There were a couple of problems with the way the Importer was handling Unicode characters. One bug I worked on was a matter of incorrect mapping of some special, reserved characters. The Importer was converting Unicode to a different format, but the map it was using to do the conversion was wrong. Additionally, I also found that a piece of code attempting to enforce text wrapping was introducing errors in a Unicode stream by placing non-Unicode characters within. I discovered that the requirement for this wrapping had gone, and was able to remove it cleanly.
  2. Retained Metadata (55): One of my first assignments, this issue was concerned with the metadata retained on the client machine after a failed import. Previously, some information was left from one failed import, and could appear on the next one. This metadata was inaccurate, and could lead to incorrect naming of documents. It was necessary to track down the correct place to clear the dictionary which stored this metadata in Pyramid, to ensure there weren’t any unintended side effects.
  3. Invalid File Uploads (59): This issue occurred when a user attempted to upload a file of a format which the Importer does not support. While this issue was never truly resolved, I did place a more helpful error message for when this case was detected. We discussed some more advanced detection options, including analyzing the size and raw contents of a file but there was some concern on overhead this could incur.
  4. Spaces in Passwords (122): This was a small bug fixed during the sprint which took place following the Connexions Conference. Basically, cnx.org allowed passwords which contained spaces while the Importer’s validation system did not. Since we were essentially using Connexions for validation of users, this meant that a subset of the users who could use cnx.org would not be able to use the Importer. I replaced the filter on passwords to be more relaxed in its rules, so that spaces could be used.
  5. Missing Images (137): While this issue has yet to be resolved and is currently being worked on by Marvin, I was able to provide helpful information in identifying the source of the problem. It seems that some images (PNGs if I remember correctly) are allowed to be edited in Google Docs, while others are static content (JPEGs for instance). Editable images are stored at a different URL, and apparently have different permissions placed upon them. This presents a problem in that that importer was not able to retrieve some images because of different permission levels required, leading to missing images in the generated CNXML. 
More deatils on the bugs above can be found at code.google.com/p/oer-roadmap/issues/list.

I also helped with some manual testing of Google Documents. The usual manual testing strategy for people working with the importer is to manually split Google Documents into subsections, placing each subsection in a separate, new Google Doc. These subsections are then each run through the importer individually. The goal here is to isolate errors down to the most basic element that can cause them. This process can be extremely tedious for long documents. As a result, I built a Python script (called splitter) that automatically generates new documents along subsections to help speed up the manual testing process. This code for this utility can be found at https://github.com/oerpub/oerpub.rhaptoslabs.swordpushweb/tree/master/oerpub/rhaptoslabs/swordpushweb/splitter.

By far the largest project of the semester was my automated testing framework. The problem being solved here is the introduction of bugs in CNXML generation. By building a testing framework that can be automatically run periodically or based on triggers, you can proactively detect newly introduced bugs in the CNXML pipeline and alert developers to their mistakes. This is not a trivial problem because of the scope of the Importer. The automated testing framework needed to robustly handle processing of DOC, ODT, HTML, Latex, and GDoc files with detection of errors in each format. This system was designed around a core of ODT documents (which is easily expandable at any time). These ODT documents can each be easily converted to equivalent documents in any of the other formats, enabling a rapid expansion from a small set of ODT documents into a larger testing suite covering all of the formats using a variety of open source tools. I have also included a Python script which uses the Connexions pipeline to generate valid versions of each document for each format. The only manual work required in the system is checking each of these generated valid documents to ensure that it is in fact valid. This only needs to be done once per document per format, and those valid files will only be deleted if the generation script is explicitly told to do so. The main script for actual testing runs each document in each format through the Connexions pipeline and compares them all against the valid versions of each file which are already available. If the comparison does not validate, a helpful error message is printed and a log file is output containing details on the exact differences between the newly created output and the valid output on hand. In order to do this comparison, a small XSLT script is also run across all generated documents which removes randomly assigned identifiers from elements in the generated XML that can cause false positives when comparing two output CNXML files. The code for this testing framework can be found at https://github.com/oerpub/oerpub.rhaptoslabs.swordpushweb/tree/master/oerpub/rhaptoslabs/swordpushweb/test.

And that’s what I did this semester! At least everything I remember.

Wednesday, October 19, 2011

Contexts and Command Queues

I briefly mentioned command queues in the first post, but I don't think I said anything about contexts (mostly because I don't think I truly understood them).

OpenCL contexts seem to be a way to group together a host system with devices in such a way as to enable giving commands to them (basically the programmer saying "these are the pieces of hardware which I will be using through this particular command queue"). I say this because, when constructing a command queue (refer to the first post for some resources on what a command queue is as well as a brief description), the constructor must be passed a context. In order to create a context we use the clCreateContext function, which basically associates a collection of devices in a certain platform with a single OpenCL context. Then, we can create a command queue which applies to a single device in that context using the clCreateCommandQueue function.

Inserting the following code after the code from the previous post results in full initialization of a context and command queue:



This code creates a single OpenCL context associated with a single OpenCL device, and then attaches a command queue to it. I haven't tried this yet, but I imagine if you passed a device to clCreateCommandQueue which was not associated with the context passed, some error would result.

Discovering Platforms and Devices

In OpenCL, all work must be associated with a platform+device. This is a little different from CUDA in that with CUDA, you are handed a default device, and only if you go around playing with cudaSetDevice or the device contexts with the driver API can you get access to more than that one device. OpenCL, on the other hand, adds some burden up front in initializing the devices in exchange for the programmer having more control over what platforms/devices are being used and probably more readable+safe code in that everything is stated explicitly.

The two functions that I have found which seem to be most useful in doing this are clGetPlatformIDs and clGetDeviceIDs. These functions allow you to retrieve identifiers for platforms and devices, as well as the number of platforms/devices available. To illustrate this, let's start with some sample code I just wrote:




This code discovers how many platforms there are, and there fetches platform identifiers for each platform into the platforms array (clGetPlatformIDs). A similar process can be done for devices on a single platform:



With this code, we can now retrieve information on all platforms, and the devices associated with those platforms (depending on what platform is set to) (clGetDeviceIDs). To build this code, place it in a .cpp file and compile using:



where AMDAPPSDKROOT is the top directory where the OpenCL SDK is. x86_64 may need to be changed to x86 depending on your platform.

Init()

I'll be trying to use this blog to catalogue experiences in computing. At the moment, I'm departing from the well trodden paths of CUDA and starting to adventure into OpenCL. I found a few tutorials online, so I hope to use this blog (at least initially) to add to that literature, and hopefully ease someone else's passage into OpenCL as well as help to track my own progress. There may be many similarities between these notes and those of other tutorials, but I have also found significant differences between my OpenCL installation and what other tutorials suggest is possible, possibly just as a result of different code versions. Currently, the OpenCL standard is at 1.1, which specifies the library's contents, while AMD's implementation of OpenCL is at revision 2.5 (though the machine I am using has 2.4 installed). I hope that I can also add information of my own. Note that I will be approaching OpenCL as an alternative to CUDA for GPUs, so I probably won't get into programming multi-core CPUs or other architectures. To start with, here are a few tutorials that I found helpful:

Pretty detailed tutorial, a few things I haven't liked so far about how the platforms are set up but definitely very helpful

Brief introduction, only uses CPUs I think

Huge collection of tutorials from an AMD conference

OpenCL Reference Pages

That last link contains a number of PDF and video tutorials under the Sessions tab. If you're just getting started, I would absolutely recommend the tutorial 1001: "Introduction to OpenCL". It's a brief (~40 minutes) but pretty good introduction to OpenCL, particularly understandable if you already have some background in GPU computing but I think pretty straightforward otherwise as well. I should also add that some of these tutorials have conflicting function calls with eachother and with the AMD SDK samples. I'm not sure yet what the proper way to do things is yet (i.e. setup OpenCL platforms and devices) but I'll let you know as soon as I do.

To get started, this is what that video tutorial taught me (I'll try to draw analogies to CUDA wherever I can, at least as I understand it).

An AMD GPU is composed of SIMD units, each of which has n processors. This is the same as CUDA with streaming multiprocessors. I'm not sure yet how # of execution units compare between AMD and NVIDIA GPUs, though I have a vague recollection that AMD uses fewer cores (don't hold me to that).

The finest element of the OpenCL programming model is called a work element. A work element is executed by a single thread, and contains a series of instructions to execute (the kernel). It can access global memory on the GPU device, as well as a piece of what the tutorial refers to as local memory (CUDA: __shared__ memory). Work items are grouped together into work groups. A work group ensures that all items contained execute on the same SIMD on a device, and work items in the same group (i.e. on the same SIMD) can all access the same local memory but cannot access the local memory of another SIMD/work group. Barriers are possible within a work group (CUDA: __synthreads()) but not between work groups (though the tutorial hinted at atomics just like CUDA).

So, on the low-level side of things, each SIMD has N processors (I think the tutorial mentioned that 16 is a standard number). Obviously, this means you want your minimum work group size to be 16, but it should be more to help overlap work groups to hide latency. However, there is this additional concept of a wavefront in OpenCL. It seems that if you have a work group with size > N (where N is the # of processors) then what a SIMD unit will do is fetch the next instruction for this work group, and run that instruction not just for N work items in the work group but for N*M where M is some integer (in the tutorial they use the example of N=16,M=4). This means that even though you only have 16 processors in a SIMD, you sort of get 64 work items executing in lock step. I imagine this might also decrease the cost of fetching instructions as you fetch once for a greater number of work items. The tutorial doesn't explain why this concept of a wavefront is added, but it seems equivalent to a warp of threads in CUDA, even though the work items technically aren't executing in lockstep. To summarize, even though you have N processors per SIMD OpenCL will run N*M work items at once by repeating an instruction M times for N work items.

Looking at OpenCL from a higher perspective, it has this concept of platforms and devices. A platform is "The host plus a collection of devices managed by the OpenCL framework", so this could be your CPU plus a number of GPUs (devices). Using platforms+devices controls exactly which piece of hardware a command is being issued to.

Commands (i.e. copying data, launching computation, etc) are issued to OpenCL from the host via command queues (CUDA streams). These command queues are associated with an OpenCL device. From the tutorial, queues can be ordered or non-ordered, meaning that they can force the order that you place items in the queue to be the order they are executed, or not. They do provide synchronization mechanisms from the host so that even in unordered command queues you can be certain that all of the previously issued commands have completed.

I think that's enough information for a quick overview post. I would definitely recommend taking a look at the "Introduction to OpenCL" tutorial video, it was very helpful in going from nothing to some understanding of OpenCL.