Concurrent processing
It has been a while since I’ve written parallel processing or concurrent processing code. Threaded programming is something that even the experts that wrote apache and php have problems with, yet, writing this type of code is somewhat enjoyable.
It started a few years ago when I replaced 112k lines of C code and libraries that never quite did what the design document specified. I communicated the idea to the coders, the coders wrote the design document and delivered a product that didn’t even do what the design document specified.
The code was scrapped and rewritten in perl and comprises about 1200 lines of code not including the CPAN libraries used. The code is faster, more reliable, and is very agnostic to its task. It has more capabilities but leaves more of the work to the tasks that is passes around which allows the code to handle communications and dispatch.
While initial testing was rather thorough based on the bugs and issues encountered during the previous version’s reign. While we’ve run into minor glitches with the new code, it is considerably more reliable to the point where it is tasked to do more. While the dispatch method was rewritten, concurrent task collision wasn’t tested nearly enough.
And therein lies the problem. The previous system accepted a task, opened a connection and waited until the task completed. Collisions couldn’t occur because each task would open a connection to the remote machine and wait until the task completed. For short tasks, this wasn’t a real issue. Longer tasks risked the socket timing out. If 15 tasks were sent, 15 connections remained open until the tasks completed.
The replacement system handed off the task but didn’t wait for the task to complete. The remote machine would handle its packet and return the task results. The issue of multiple tasks being added for the same machine results in a few collisions. Task order isn’t important, but, sometimes a task is fetched twice or a task is missed and left in queue. A task in queue is redispatched, but, the double fetch issue has been difficult to debug. Put in the slightest amount of debugging code and voila, tasks are dispatched properly under every test that can be thrown at it. Remove the debugging code and the error returns.
While the task is left in the queue for processing and the file locking for the state machine has been double and triple checked, but, I’m sure once I dig into it, I’ll find some logic error that leaves a stale lock or incorrectly clears a lock.
I remember I used to love writing code like this, though, I always dreaded debugging it.