Batch Processing

https://www.flickr.com/photos/pargon/2444943158
Pargon (flickr.com)

I love learning new words, especially if they help to explain something in a way that I never thought of.  For instance, let’s take the following word:

idempotent: the property of certain operations in mathematics and computer science, that can be applied multiple times without changing the result beyond the initial application

Or, at least, that is what Wikipedia says.  But, what does it really mean?  Suppose you had a function that updated someone’s birthdate and you called it with the value ‘1984-04-03’.  Afterward, that person’s birthdate would be ‘1984-04-03’.  It wouldn’t matter how many times you called the update function, the value would still be the same.  Reading data is an idempotent operation as no matter how often you read it the values are the same.

So why is this important?  Let’s say that you have a job, whether a Windows Server, a batch job, something in ActiveBatch, etc., and you have multiple servers upon which this job can run.  If it is a Windows Service and it is running on multiple machines what happens if the Windows Service runs twice?  It if is idempotent, nothing is going to happen to the data that needs fixing.  If is it a job that runs periodically and picks up unprocessed records, what if two jobs are running simultaneously and each one is processing the record?  If it is idempotent, there is no issue.  If one job fails after processing but before updating to say that the record was processed?  Idempotent = no problem.

There is a growing movement within the IT industry to ensure that “cron” jobs (scheduled jobs/batch jobs) are idempotent wherever possible.  So what impact will this have on systems?  For systems that only have a reporting function done in “batch” then there is virtually no impact.  For systems that process data in batch then you will need to examine if this is truly what needs to be done.  Some things can not or should not be made idempotent.  A payroll job that runs every two weeks?  No, it shouldn’t be idempotent, it needs to run once and only once.  Sending out an email to ten or twenty thousand people?  Preferably only once, but if it does run twice there is no data damage.  Perhaps some reputational damage but no damage to the data or the system.

Batch is a remnant of the 1980’s.  Current batch processing consists of two very distinct components:  processing and reporting.  Batch reporting is idempotent whereas batch processing is not.  But the real question that needs to be looked at is whether or not the batch processing is a requirement of the system or a leftover from mainframe processing.  Based on how many organizations do not incorporate a batch processing element into their systems, I would have to say that many of our batch processing requirements are due to a lack of understanding of the business or the capabilities of the technology.  Either way, it can be fixed.

Leave a Reply