Monday, April 30, 2012

The Connexions Importer Project

This past semester I had the chance to work with Kathi Fletcher on the Connexions Importer. Kathi’s work with the Importer, which is supported by the Shuttleworth Foundation, focuses on building an easy to use content importer for Connexions (an open repository of educational material at which enables content generators (teachers, professors, writers, etc) to upload documents of a variety of formats into Connexions. This removes boundaries to the distribution of free resources, worldwide. As a member of Kathi’s international team, I had several responsibilities throughout the semester.

One main area I worked in was bug fixes. This initially served as a good way to become familiar with the code base being developed, while still contributing to the stability of the project. As I became more familiar with the code, I was able to take on more complex bug fixes. Among the issues I worked on this semester were:

  1. Unicode Errors (92+93): There were a couple of problems with the way the Importer was handling Unicode characters. One bug I worked on was a matter of incorrect mapping of some special, reserved characters. The Importer was converting Unicode to a different format, but the map it was using to do the conversion was wrong. Additionally, I also found that a piece of code attempting to enforce text wrapping was introducing errors in a Unicode stream by placing non-Unicode characters within. I discovered that the requirement for this wrapping had gone, and was able to remove it cleanly.
  2. Retained Metadata (55): One of my first assignments, this issue was concerned with the metadata retained on the client machine after a failed import. Previously, some information was left from one failed import, and could appear on the next one. This metadata was inaccurate, and could lead to incorrect naming of documents. It was necessary to track down the correct place to clear the dictionary which stored this metadata in Pyramid, to ensure there weren’t any unintended side effects.
  3. Invalid File Uploads (59): This issue occurred when a user attempted to upload a file of a format which the Importer does not support. While this issue was never truly resolved, I did place a more helpful error message for when this case was detected. We discussed some more advanced detection options, including analyzing the size and raw contents of a file but there was some concern on overhead this could incur.
  4. Spaces in Passwords (122): This was a small bug fixed during the sprint which took place following the Connexions Conference. Basically, allowed passwords which contained spaces while the Importer’s validation system did not. Since we were essentially using Connexions for validation of users, this meant that a subset of the users who could use would not be able to use the Importer. I replaced the filter on passwords to be more relaxed in its rules, so that spaces could be used.
  5. Missing Images (137): While this issue has yet to be resolved and is currently being worked on by Marvin, I was able to provide helpful information in identifying the source of the problem. It seems that some images (PNGs if I remember correctly) are allowed to be edited in Google Docs, while others are static content (JPEGs for instance). Editable images are stored at a different URL, and apparently have different permissions placed upon them. This presents a problem in that that importer was not able to retrieve some images because of different permission levels required, leading to missing images in the generated CNXML. 
More deatils on the bugs above can be found at

I also helped with some manual testing of Google Documents. The usual manual testing strategy for people working with the importer is to manually split Google Documents into subsections, placing each subsection in a separate, new Google Doc. These subsections are then each run through the importer individually. The goal here is to isolate errors down to the most basic element that can cause them. This process can be extremely tedious for long documents. As a result, I built a Python script (called splitter) that automatically generates new documents along subsections to help speed up the manual testing process. This code for this utility can be found at

By far the largest project of the semester was my automated testing framework. The problem being solved here is the introduction of bugs in CNXML generation. By building a testing framework that can be automatically run periodically or based on triggers, you can proactively detect newly introduced bugs in the CNXML pipeline and alert developers to their mistakes. This is not a trivial problem because of the scope of the Importer. The automated testing framework needed to robustly handle processing of DOC, ODT, HTML, Latex, and GDoc files with detection of errors in each format. This system was designed around a core of ODT documents (which is easily expandable at any time). These ODT documents can each be easily converted to equivalent documents in any of the other formats, enabling a rapid expansion from a small set of ODT documents into a larger testing suite covering all of the formats using a variety of open source tools. I have also included a Python script which uses the Connexions pipeline to generate valid versions of each document for each format. The only manual work required in the system is checking each of these generated valid documents to ensure that it is in fact valid. This only needs to be done once per document per format, and those valid files will only be deleted if the generation script is explicitly told to do so. The main script for actual testing runs each document in each format through the Connexions pipeline and compares them all against the valid versions of each file which are already available. If the comparison does not validate, a helpful error message is printed and a log file is output containing details on the exact differences between the newly created output and the valid output on hand. In order to do this comparison, a small XSLT script is also run across all generated documents which removes randomly assigned identifiers from elements in the generated XML that can cause false positives when comparing two output CNXML files. The code for this testing framework can be found at

And that’s what I did this semester! At least everything I remember.

No comments:

Post a Comment