Thursday, April 27, 2006

[geek] The bugs, they are subtle...

So at work today, I was greeted by an email containing possibly disastrous news: our software (the version rushed out for the agricultural study, remember?) looked like it might be eating data. After seeing it go live, the client had some (quite reasonable) changes they wanted implemented and gave us a week and a half to get them done. Most of them were content related, not logic, so it wasn't a big deal to implement and test the changes, then turn the app around for another go in the field. So they schlepped out to East Buttcrack, WA, app in hand, for another round of (we mutually agreed) pilot data collection. The primary investigator (or one of his minions) then had the good sense to actually import the data into a stats package they were familiar with and take a preliminary look at the data to see if they had everything they expected - for instance, did the subject IDs in the database match up with those in the hand-written log books? (IMO, it's always a Good Thing when the client is saavy enough to tell if what they're getting is crap or not. They know what they should be getting, while we're pretty much guessing. We try to make as informed a guess as possible, but we aren't the domain experts in what we're trying to model in the software. But I digress...) The client threw up a red flag: one of the interviews that they conducted was missing from the database. Gone. Poof. Vanished into the digital aether. That's not good. That's really not good. Still, for debugging a client-reported defect, we had a lot more data than we typically start with. Again, this is a Good Thing. They brought in the offending machine, and we went to work on it. Sure enough, there in the event log was error after error: SQL transaction failed, again and again and again. We were logging the exceptions, but we weren't alerting the user to them (a design decision I personally disagreed with, but hey...). We probably couldn't have told them to do anything more intelligent than exit the application and start over, but that probably would have fixed the problem. So off we go - what causes this error? Debuggers are fired up, process explorers are going... and they give us nothing. Nada. Bupkus. We can't reproduce the error. Oh, we found the cause of a similar exception that was being logged whenever the app exited (the code worked properly despite itself), but a wholesale failure of a given data-collection session? Nope. That suxors. Lots. And not in a good way. Finally, my coworker (he's a dev, but he's also more of a jack-of-all-IT-trades kind of guy) is free-associating, and wonders if somehow threads were sharing data. No, I tell him - while you can do that with relative ease, we aren't doing anything like that with this app. All the components run inline, in the same thread. We spin off child threads for load balancing, but we don't exchange data between threads. We don't need to, it's a disaster waiting to happen, and, done poorly, it can quickly become the programming version of the Witch House, bending reality beyond anything that Should Be. But maybe he was on to something... He mentioned that it was a fairly common occurence for the field staff to launch more than one instance of the app - the app runs on a Tablet PC, and the 'double-click' with the stylus breaks the rules that Windows follow in a mouse-driven environment. Not much, but it's enough of a deviation that it takes a little while to get used to. Second, with .NET applications, there's almost always going to be lag the first time you run the app. Like Java, the app is compiled on-the-fly; those pieces that get accessed the most often are then pre-compiled for subsequent accesses. This improves performance eventually, but initially it can seem like nothing's happening. And what do users do when nothing's happening? They click again. Third, these tablets are running fairly puny mobile CPUs. This increases the lag time, relative to the PCs upon which these apps are developed. And what does greater lag time mean? Yup, more clicking. So, three or four instances of the app open results in each creating their own database record - but if the lag hits just right, more than one instance of the app will 'share' the record ID of the most recent record. The staffer, realizing that they've got more than one instance running, will toggle around and close out those 'extra' instances. The app, being fairly polite and well-intentioned, will clean up after itself and delete whatever unused database record it is holding on to. If the instance they leave running shares the record ID with one of the instances that got closed, the closing app sees that the database record hasn't been used yet and deletes it. The still-functioning app no longer has a valid record to write to. It's, well, gone. Poof. Vanished into the digital aether. Since we aren't throwing the resulting exception in the user's face, they proceed apace, blissfully unaware that none of their data is being stored. And you can't reproduce the error if your machine is beefy enough. Can't. Won't happen. You can launch as many instances of the app as you want, and each and every one of them will have the correct record ID in memory. If our client hadn't been willing to turn the machine in to us, we wouldn't have ever found the bug. They would have continued to lose data, we would have continued to be unable to reproduce it, and things would have gone south from there. I lose sleep over bugs like this: intermittent and (apparently) unreproducable bugs are a nightmare, and our user base isn't broad enough (and our hardware base is too monocultural) for us to see any kind of hardware-related (or third-party software-related) patterns of errors. In this case, the fix was easy - on startup, see if there's another instance running. If there is, quit quietly and gracefully and bring the other instance to the fore. If there isn't launch away. But all too often, programming depends upon gut feelings and hunches to identify the sources of problems, and when the fixes aren't quite so easy, you run a tremendous risk of ending up with more problems than you started with. Today we were lucky. We won't always have that luxury.

4 Comments:

Blogger Brian Dunbar said...

We were logging the exceptions, but we weren't alerting the user to them (a design decision I personally disagreed with, but hey...)

I'm not your client, I'm not and end user. But I support some fairly esoteric applications in a manufacturin environment.

Throw them exception alerts, baby. Toss them right in the user's face. Let them know things aren't right so they don't go merrily along entering data when nothing is really happening and then get cheesed off when they find out they've got to repeat the data entry ..

Just my opinion of course.

4/28/2006 07:36:00 AM  
Blogger protected static said...

It's mine as well... When I relase my component into the wild, if it has a publicly accessible API, someone somewhere might, you know, actually try to program with it. I have no idea how they're going to want to use my component, nor do I know if I've accounted for every possible boneheaded thing some other developer (or this developer, for that matter) might do.

I think that part of this is a bit of a culture clash - my coworkers come largely from a web development background (creating tools to be used by other programmers and technical end users), while I've mostly done custom database and desktop development for (non-technical) business environments.

4/28/2006 07:57:00 AM  
Blogger Stephen Spencer said...

Bravo on finding that esoteric bug, my friend.

4/30/2006 11:15:00 PM  
Blogger protected static said...

Woulda been better to have not introduced it in the first place... ;-)

4/30/2006 11:45:00 PM  

Post a Comment

<< Home