Service Level Agreement

My apologies in advance if I seem a little crabby/cranky/ornery this past week but I’ve been Yvonne’s backup, I’ve been sick and my darn sump pump stopped working just long enough to soak the carpet.  (Wet carpet sure weighs a lot.)  When our sump pump went we called up a plumber based on a friends recommendation.  He showed up exactly when he told my wife he would.  He advertised a service and lived up to that agreement.  He essentially met his SLA (Service Level Agreement).

Regardless of what we might believe, we all have certain SLAs as part of our role in an organization, our role on a project, our role as husband/wife/companion, as a animal owner, as a coach, as a friend, as a consumer, as an eBay entrepreneur, etc.  We have so many different SLAs that even if we tried we couldn’t identify them all.  If the role you have in an organization is based on providing services to others, then you definitely have SLAs in place even though they may not be formal.

For instance, it is generally expected by the Development Teams that if they have a problem they send it to the "_EDC-Deployment Team" list and someone will take a look at it.  While this is indeed the pattern that we have been asking people to adopt, we have not specified the response time for such a request. 

Let’s assume that the request is five minutes work.  Does this mean that you should expect to receive a reply in five minutes?  What happens if the person who needs to do this has just gone to a meeting?  What if they had to go to the washroom?  OK, maybe you change your expectation.  Maybe you change it to:  receive a response from the Deployment Team in twenty minutes even if they haven’t actually done the work.  What about lunch?  Is someone from the Deployment Team supposed to be constantly checking email during lunch?  What about a staff meeting?  Is it going to be possible to have a staff meeting with the members of the Deployment Team if someone is constantly answering emails and responding to requests?  Do we implement tiers of service whereby Production always gets responded to within minutes (5?  10?) and the other environments on a "when available"?  But what if someone is doing some testing in UAT so that they can confirm that a hot fix for Production is working?  If this is Production related does it fall under the Production response time or the other environment service level?

While my wife and I were thrilled that the plumber met his service level, I could envision dozens of legitimate reasons why he wouldn’t be able to do so.  And yet, even though those reasons would be perfectly valid, we both know that I’d be cranky that he missed his appointment, because, after all, he affected me.

Why am I bringing this up?  Well, everyone reading this not can probably tell his or her neighbour about the time the Deployment Team took x minutes to do y seconds worth of work.  While I sympathize with you and promise that we’ll do better, please realize that there may be extenuating circumstances:  we may be short staffed due to illnesses, we may be with another client helping them or something as mundane as no one has had a bite to eight in 10 hours and they’re hungry.


Internal Expectations

I’ve talked in the past about expectation setting with your users and how that is a good thing.  Internal to ITM, however, we have areas that provide support to other areas and there are expectations in this area as well.

Recently I had the privilege of being Yvonne’s backup with regard to database administration work.  Most of it has been in the area of DeCo requests and the expectation of the end users has been, well, stunning, to say the least.  While some people did not seem to have a specific timeline in mind with regard to when the data fix needed to be done, other people, heck even the same people but for a different request, seem to have the expectation that if thirty minutes had passed since the request had been submitted then the request should be completed!!!

Is this reasonable?  Well, whether or not I think it is reasonable, it is what the end users are expecting because that is the level of service that Yvonne has been providing.  This makes it very difficult for anyone trying to even temporarily fill in for Yvonne as the odds are that this person is not trying to do Yvonne’s job, just cover for her, which means that they still have another job that they have to do.  Unfortunately there is no simple solution to this as the expectations have been set and trying to claw them back now just isn’t going to work.  In the long term we are looking at a solution that will allow for more self service in terms of data fixes while still maintaining the high level of auditability.  (Potentially higher, but if you want self serve then you will have to live with it.  🙂 )

Conversely, however, there is an expectation on part of the DBA – OK, really just me in Yvonne’s stead – that the SQL submitted for running actually, well, runs successfully!!!!I  If you submit a piece of SQL to be run as a data fix, and that SQL does not run successfully, do not expect the DBA to fix it for you.  Instead, what I am advocating to Yvonne, and what I will be following while in her place, is that if the SQL fails, immediately stop the deployment and reject the deployment request.  I mean, if we were implementing an application and the installation package failed, we would notify the coordinator and say fix it.  Why should we do something different with SQL than with an application?

Serious Problem Here

There are three more "ideas" that I was sent that I want to discuss:

If your database needs constant data fixes (projectX, projectY), FIX THE CODE.

This is such a simple concept, but we seem to have missed it somewhere in our education.  If you need to do the same data fix, but with different parameters, over and over again, there is a problem with your code.  Rather than invest the energy in creating, executing and tracking data fixes, fix the code.  At the very least, let the users fix the data themselves.  Create a screen that accepts the required parameters and executes it on behalf of the user, with all of the proper auditing that needs to be in place.

If you are corrupting records, it isn’t the database, FIX THE CODE

It’s amazing how many times the database has been identified as the culprit as to why the data is in it’s current state.  "Oh, something must have made SQL Server corrupt the data" or "Maybe SQL Server can’t handle the load that we are asking it to do and it failed to do something we told it".  Maybe the real answer is that SQL Server did exactly as it was told.  Our code told it what to do and it did it.  SQL Server is not an autonomous machine randomly changing data on a whim.  If the data is corrupt, we, the humans in change, did it, not the machine.

If you have a 24 hour load process, don’t blame the DBA, the hardware, the OS, maybe FIX THE CODE

We’ve got really large machines in our UAT and Production environment.  You would be amazed at what they are capable of doing, not in a day, not in an hour, but every minute!!!!  There is enough processing power to do physics calculations on the interaction between dozens of objects in an environment ten, twenty or thirty times per second.  There is enough processing power to correctly calculate and render a 3D environment in which these objects exist in real time and show them at thirty frames per second.  What I’m trying to say is that the machines are more than capable of playing the worlds most complex and demanding computer games in existance and provide excellent response.  The graphics and physics engines embedded in these games make thousands of calculations every second.  Tens or even hundreds of thousands of calculations every minute.  And yet, we’re busy doing 100 records a minute into a database and we’re happy. 

Does someone other than me, and the person that highlighted these points, see a serious problem here?

Planning to Fail

I’ll be honest, I kind of suck as a Project Manager, (sorry, Dawn, but you’ve already hired me!!!!)  My problem lies around the attention to detail that a Project Manager needs.  When coming up with a project plan they need to ensure that all tasks for the project are actually accounted for in the project plan.  Having come from a large consulting company that part was actually made a lot easier by using a standard template that had all possible tasks and you actually remove the things you wouldn’t be doing in this project.  (Removing things is a lot easier than adding things.)  By adding in some default assumptions about the size and complexity of the application, viola, out came a project plan.

The next step is assigning people to tasks.  This is where 90% of project managers make the biggest mistake.  If you have 80 hours worth of work and you have a person who works 8 hours a day, how long will it take to do the 80 hours of work?  If your answer was 10 days, congratulations, you are part of the 90%.  If you said something else, you knew it was a trick question and you decided to play games with me, so consider yourself part of the 90% as well.

Contingency.  Nothing is perfect.  No task is perfectly timed.  You plan for the unknown.  You plan for things that you can’t forecast.  You add a certain percentage of the task time as contingency.  This is essentially a slush bucket of time to fix things or redo things that you didn’t know you would have to do when you originally scheduled the task.

Sick Days.  While this 80 hours of work may not intersect with an employee being sick the longer the project is the more likely you are going to have sick employees.  Indeed, depending upon which survey you agree with the number of sick days per year for IT workers varies considerably, but probably averages out to about 1 day for every 25 to 30 days worked.  Some people try to cover sick days under contingency while others have a separate category for it.  (If the project is long enough, consider accounting for vacation time for all staff as well as support personnel.)

Miscellaneous Project Time.  Let’s not forget about the fact that you are sending emails to this employee that he has to read (skipping development), or that he needs to attend a project meeting or fill out a project report, all of which is "non-productive" time.  Indeed, non-project time can be a considerable amount of time.

Longer hours = less productivity.  Studies have shown that for short periods of time, one to two weeks maximum, people can work longer hours without decreasing their hourly productivity.  After that, however, their productivity hits a rapid decline.  I worked on a project where 55 hour weeks were mandated.  After a couple of weeks of this they got just as much productivity out of people as they would have if a 40 hour week was enforced.

So, how long will it take for 80 hours of work to be done?  11.75 days +/- 1.4 days

How many days are going to be in the project plan?  10 days.  We’re already off to a bad start and the project hasn’t even began.  No wonder people dislike project plans or even planning in general.  You’re already in trouble and you haven’t even started.

Fixing Code

I’ve touched a nerve and it seems to be a pretty raw nerve with a lot of people.  Fixing code.  Here’s what one person said the developers need to do when fixing their code:

  • take the time to understand the bigger picture and the context of the codes’ function
  • take the time to understand the impacts of what they’re doing, even just commenting out a line
  • rewrite their code rather than creating more spaghetti code

So, this person seems to want programmers to understand what they are doing, the impact of what they are doing and to do it correctly (no spaghetti code).  When you go to a mechanic to have him fix a problem with you car, you expect him to know what he is doing, to understand the impact of the change and to fix the problem correctly.  If you go to a doctor for an operation, you expect him to know what he is doing, to understand the impact of the operation and to do the operation correctly.

So why is it that programmers, developers, whatever you want to call them, why is it that they continually make silly changes to the code, make things more complicated and just don’t seem to care to do things correctly.  For instance, in a recent project recently we observed the following code (things have been changed to protect the guilty, but the same concept is there):

IF anotherField > 3 THEN year = year +3

IF anotherField <= 3 THEN year = year + 3

These lines were back to back in the code.  The net result is that regardless of the value of anotherField, year is going to be incremented by 3.  There may have been some different logic in there previously, by whoever made the change did not bother to look up and down the code for a few lines, the immediately preceding line in this case, to see what was going on.  As it is the code is longer than necessary, takes longer to run than necessary, is more complicated then necessary, is confusing to the casual reader and just needs to be rewritten.

Yes, sometimes to fix a problem you only need to change a single line.  But the better approach may be to change half a dozen lines, or more, so that the next person finds the code cleaner, simpler and better.

Efficiency and Simplicity

So, for the second time in less than a week I had the opportunity to use the Capital Health Region’s medical services.  Last week I spent three hours in a medicentre waiting room to see a doctor for three minutes.  I would have gone to another medicentre but this was the only one that was open in a 10km radius that I could find.

On Sunday I witnessed the response to a 911 call and was pleasantly surprised at the rapidity with which the response came.  911 answered the call before I was even sure that the phone had wrung and when I was forwarded to the ambulance dispatch, the ambulance was already en route to the house.  How did they do this?  Effective use of technology.

Not just "the use of technology", because, quite honestly, sometimes technology makes this worse instead of better.  Not because the technology is not helpful, but because of the complications in using the technology or the poor implementation of the technology.  The luggage handling fiasco at the Denver Airport is one such example of how technology can be beneficial (the concept was sound and the potential payoff tremendous) but if implemented poorly the result is millions of dollars of effort wasted and, in the long run, a return to manual labour.

When I called 911 they had the reverse lookup available for them so that they knew the phone number I was calling from and the address associated with that phone number.  They had me confirm the information in order to ensure that the records they had were correct and up to date. The rest of the conversation was spent getting details of the incident so that they could be transmitted to the EMTs who were in the ambulance in order to prepare them.  The information that the dispatcher entered into the system was immediately available to the EMTs.

While we try to build our systems to take into account all of the potential problems and all of the potential solutions, we need to keep in mind that there is a very specific task we need to accomplish and that we should focus on getting that task completed in the most efficient manner possible. Not the most elegant manner and not the most all-encompassing manner, but the most efficient.  While you may not be writing an application that is designed to help save lives, that doesn’t mean that the concepts – efficiency and simplicity – are not applicable to what you are doing.

Pulling the Plug

At what point do you "pull the plug" on an old system?

I know, there are lots of reasons why the system needs to be kept around and there is always the concern about where the money to redevelop the system is going to come from, but at some point you need to look at the system and say "It’s been nice, but so long old buddy."  When does this moment come?

You could look at things from a purely objective point of view and work out the cost benefit of a new solution.  You examine the development costs of a new solution and compare it to the support costs of the existing solution.  You can look at the risks involved with both the old and new systems and, if possible, assign a dollar value to the risks.  Are you using technologies that are no longer going to be supported?  At the end of the day you will end up with a chart showing you, in all likelihood, that it is more expensive to build a new system than to keep the old.

You can start going off into areas that are not as easy to quantify.  How much will it cost the organization if a certain piece of functionality is not implemented?  Can this even be quantified?  What about staff morale?  Is the current system lowering staff morale to such a point that you are in danger of losing key staff?  Is it contributing to the turnover rate in your area?

Sometimes, though, even this isn’t enough.

In the move Blade Runner the replicants, the "artificial humans", were built with a four year life span.  At the end of that time they simply died.  I sometimes think that this is what we need in some of our applications.  Call it a Software Personal Directive.  This would be a deadline after which the software is supposed to be replaced.  Plan for it.  Get ready for it.  Announce it to the world, but most of all, understand that software doesn’t last forever.  As the technology changes, so must the software built on that technology.  As long as we have constantly evolving operating systems, browsers and users, our applications are limited in life.  We need to understand this concept and embrace it.

What is the right life span?  The replicants were given a four year life span.  In my opinion, critical, core applications need to be reviewed, examined and potentially rewritten every four years.  Other systems are stable and probably won’t change in twenty years.  Yes, I’m complicating matters by not setting a single life span, but when have I ever taken the easy way out of a conversation?

Efficiency

When I went to NAIT we had an interesting instructor for one of our classes.  He was supposed to be teaching us how to design interfaces for applications.  Since this was the day of the green screen his examples were all of mainframe based applications.  Looking at those screens brought some unpleasant feelings welling up from within.  It was the opinion of everyone in the class that the instructor "didn’t get it".

Indeed his screens were overly complicated, cramped and just plain confusing, but he was convinced that you needed to design screens this way in order to be efficient.  Why the disconnect?

Some people confuse efficiency with the amount of information that they present to the user in a single screen.  The more "information" the more "efficient" the screen.  This is so wrong.  Studies have shown by keeping things simple people can comprehend what is going on faster and with a higher retention rate.  In the example that our instructor gave us he was trying to demonstrate what needed to be on a customer contact screen.  He had not only the basic customer demographics, but a recent contact list, dates and amounts of previous orders, recent changes to the account and payment history.  Some of this information was denoted in codes which the person looking at the screen was supposed to have memorized in advance.

Information overload is not efficiency.  For instance, if in the majority of cases the user just wants to look at demographic information, show just the demographic information.  Tailor the information that you present to match the circumstances for the screen.  Don’t be afraid to add a tab or other item for the user to get more detailed information.  IF they need it you have provided a means for them to get it.  If they don’t need it you wouldn’t have wasted any time retrieving it.

The purpose is to allow someone to do the same job they are doing now more efficiently and sometimes that actually means less information, rather than more.

Support

The other day I was adding a terabyte of storage through a NAS device on my network at home.  There was a question I needed to ask and tried both the manual and the web site, but neither had the answer I was looking for.  Since I was still within my warranty period, out came the phone and within a few minutes I had the answer I needed.

There are a lot of products that require you to install, configure, or even assemble before you can actually use the product.  Computer equipment is notorious for this, but so is almost anything electronic.  Even the desk that I use at home, assembled by my wife and I, came with instructions on how to assemble it and, if we had problems, a phone number to call.  (Recently found these instructions due to the sump pump/water heater problems we’ve had at home.)

The quality of the assembly instructions varies quite a bit as well.  The less expensive the product (i.e. cheap stuff) usually comes with instructions along the lines of "unpack, set up, turn on".  Other products go through detailed instructions.  The instructions for the desk were excellent and told exactly what we had to do at each point.  Lego is much the same as it shows you step by step how to build a castle or dragon or spaceship.  (VCRs are the exception in that no one really knows how to set the darn time on them.)

So, I guess from my perspective the quality of the installation instructions and the ability to call support are really important pieces of quality products.

So, why are the instructions that the Deployment Team receives so, well, lacking in terms of details.  When I was Yvonne’s backup I noticed that people would just attach some SQL scripts and submit the request.  No indication of even which database to run the scripts against.  Sure, I could make some assumptions, but to be honest, I really don’t like guessing when it comes to data fixes.  An extra few characters, the name of the database for goodness sake, goes a long way towards answering questions.

For more complex installations having someone that we can call would be handy.  No, necessary.  If you’ve got something complex that you are installing (more complex than a set of IKEA shelves) then make sure that someone is available to help us out if there are any problems.  It may not be necessary.  No, shouldn’t be necessary, but if there is a problem it would be nice to be able to talk to someone about it.  Maybe I’m getting old (yes, that was my 45th birthday earlier this month), maybe I’m getting cranky (based on my emails this month that is a guarantee), or maybe I am just becoming more demanding, but if you’ve got an important or complex migration coming up, you better have decent instructions and someone we can call if there are problems.  If not, you can expect a cranky, personalized email.

Beautifying SQL code

Boy, I must be sick.  The other day, for fun, I was reading a blog post on beautifying TSQL code in SQL Server.  Looking back over my brief stint as Yvonne’s backup, I too shuddered when memories of those days flooded into my mind.

I have to agree with the author when he said that we are spoiled due to the formatting capabilities within Visual Studio.  Back in the old days, when computers were measured in terms of MHz, not GHz, a lot of the tools did not provide automatic formatting capabilities.  Instead, it was left up to the programmer, the human behind the keyboard, to manually alter the formatting in such a way that it was readable.  The company I used to work for actually had some very specific coding standards with regard to COBOL. 

  • The TO keyword in a MOVE statement was supposed to start in column 40
  • Paragraph names started with a 5 digit number and were ordered sequentially within the application
  • Constants within the application (ex. ‘3’ or ‘T’) had to be removed and replaced with a working storage item

Within Visual Studio there are a number of formatting rules built in that make things easier to read for the majority of people.  For TSQL coding, however, we seem to have failed to understand that "neat" code is actually easier to read, easier to understand and easier to modify.  For instance, one of the most beneficial items is the indenting of blocks of logic so that they are set apart from the lines around them.  This helps you to identify all of the lines of code associated with an IF statement.  So why don’t people do this?  Is it too hard to hit the tab key one more time?  Does RSI prevent you from using your left pinky?

Seeing a string of code with no indentation, no capitalization, no effort made to make the code readable is quite distressing.  One of the things I was taught at an early age (OK, 22) was that we are custodians of the code.  We do not own it, we modify it for our clients.  It is their code.  It is their application.  If I house sit for someone I don’t go into their house and make a mess.  If I am looking after their pets, I don’t take the cat litter and sprinkle it on the carpet.  So why would I write messy code and leave it behind for someone else to decipher and clean up?