Monday 18 August 2014

Binary Search Algorithm: Applied in real life

When I announced the availability of MapGuide Open Source 2.6, we did not have a CentOS build available due to a serious blocking issue where iconv APIs would mysteriously fail within MapGuide.

The main symptoms of this iconv failure are:
  • Under default build settings, mgserver will fail on start up with a "could not load a transcoding service" error from the xerces library. The default build settings use iconv APIs to transcode strings within the xerces library.
  • The SHP FDO provider will throw a "memory allocation failed" error when attempting to connect to any SHP feature sources. The SHP provider also uses iconv APIs for narrow <-> wide string conversions.
While the first problem was worked around (by building xerces to use a different transcoder), the second one was truly back breaking. I didn't want to release 2.6 for CentOS where support for the most ubiquitous spatial data format was broken out of the box.

As a first priority after the 2.6 release, I went to see when this issue first cropped up and what was the offending component.

The 2.5.2 release for CentOS did not have this issue, so to eliminate FDO as the offending component I built the 2.5.2 release against the FDO 3.9 branch. Fortunately, the iconv failure did not show up, so we now knew that FDO was not the culprit. It was going to either be MapGuide or one of its Oem components that has caused this breakage.

Knowing that FDO was not the culprit, it was time to start identifying the svn revision in MapGuide that brought us this mess. Unfortunately, there's been quite a lot of revisions between 2.5.2 and 2.6 and knowing how long it takes to build and verify a single revision of MapGuide (because ... C++ code), it would be painstaking to build and verify every single revision.

So to take a logical shot in the dark as a means of reducing the set of svn revisions to identify, I picked the very first working revision of the 2.6 branch to see if this issue exists. It didn't (yay!), meaning our problem space is now reduced to 60 commits in the 2.6 branch.

One of these 60 revisions broke the CentOS build. Rather than wasting time building and verifying 60 individual revisions to identify the breaking revision, I took a more systematic approach and picked the closest "mid-point" revision that affected files in the Server/Oem/Common/Web directories.

If the problem showed up there, we can reduce the problem space to revisions older than that one being tested (ie. some revision older than the tested one introduced the problem), otherwise we can reduce it to revisions newer than that one being tested (ie. some revision newer than the tested one introduced the problem) and repeat the process, until our problem space becomes a single revision that fails. That revision is the revision that broke our CentOS build.

As the tale of my Trello card to track this problem can attest to, finding the offending revision was pretty quick.


And if the title of this post didn't give it away, this systematic process has a name: It's called a Binary Search Algorithm

Though in our case it's not a true binary search, more like a "biased" binary search in that although our problem space was 60 revisions, some of these revisions did not touch any part of the MapGuide Server/Oem/Web/Common code base, so such svn revisions can be excluded from our problem set. Also when the candidate revision to test landed beside a "big merge" revision, we tested that "big merge" revision as well just so we can immediately rule it out.

So there you have it. Knowing Binary Search is not just for passing your Computer Science exam or that Software Developer Job Interview or to implement various data structures, it has real life applications too like hunting down what commit broke your build.

Some might say, wouldn't a Continuous Integration system have caught this? Indeed it would've, but as I've talked about previously about how we make our builds of MapGuide, the Windows builds of MapGuide are built under Jenkins (which can detect and flag broken/unstable builds thanks to its rich ecosystem of plugins), but the Linux builds are not. Linux builds although they are mostly automated now, still require manual invocation (to start the vagrant provisioning process) and manual review of the various log files produced to see if anything broke. The offending revision obviously slipped through the radar.

The Linux build system obviously has much more room for improvement, something that we can now have the opportunity to explore now that 2.6 is out the door.

But before we go about that, let's put out that overdue 2.6 build for CentOS.

No comments: