posted by avanabs on March 19, 2010 06:34 AM
How to lose a half million dollars on one bug!
I want to share with you a recent experience from one of my clients. They have been using Apache Tomcat for several years, often combined with WebSphere (they are a “Big Blue” shop end to end) for supporting back office stateful applications. Early on, they decided to support Tomcat themselves, primarily because they did not find any viable vendors, but mostly because the development team (who had been using Tomcat on their desktops) convinced management that this would be “virtually free”. “No problem!
This has worked very well, particularly while the original Tomcat aficionado continued to provide "support" and...unknown to management...enhancements. There have been only occasional issues, easily handled by their in-house application programmers, with occasional help from the Tomcat community. No one even bothered to keep track of time spent maintaining Tomcat, because it was “free”. Over that time, however, their Tomcat version has diverged from the Apache project, because maintaining compatibility wasn't an objective, because the cost (mostly developer time) to submit their fixes to the Apache community wasn’t in anyone budget, and partly because re-integrating ongoing Apache changes was also un-funded drudgery. So, this organization is “using Tomcat” as far as management knows, but is actually using a diverging branch. All that said, the process continued to work fine and the visible costs were indeed low.
About a month ago, a new application was developed and put into test. This application was fairly simple, but it was projected to generate $ 100,000/week (TINY by their standards) initially, ramping up to over a half million/week by Q3. It was also the first visible peek at a new business strategy. The problem was that the application failed erratically during test. Subsequent debugging indicated this was due to a bug in Tomcat, not the application, so one of their application guru’s quickly rolled out a Tomcat bug fix (enhancement???) and delivered the result back to test.
At first, everything seemed fine. The new application worked great and passed thru QA with flying colors. “Free” self support won again…or did it? Application developers working on other projects fairly rapidly found that the new version was breaking some of their legacy applications, including several commercial apps. So, the application developer went back to the drawing board and quickly generated another fix.The new application failed, as did almost everything they tried to run. So, the process continued, with more than two dozen fixes generated and tried, and generated and tried, and…
To make a long and painful story shorter (it STILL wasn’t working), self support had now caused:
-
A month delay in launching the new business opportunity.
-
The loss of almost a half million dollars in revenue/cost savings
-
A severe delay in their new business initiative
-
Another large amount in lost opportunity cost (they had three engineers working full time on the “quick fix”, not to mention test time).
-
A large amount of stress/distrust between the business unit, development, and IT operations, including a major backlash against Tomcat and re-opening the “you should’a used WebSphere” discussions. Red faces all around.
So much for “FREE”!
Lessons Learned:
-
Self Supporting Tomcat is indeed a viable option, BUT there are reals costs involved and these need to be well understood.
-
Server software simply isn’t the same as application software. Some (few) developers can do either, but generally they will also be your guru’s, and while they are fixing Tomcat bugs (or troubleshooting non-bugs reported by their application developer colleagues), they aren’t building value for your business. No matter how good, they can’t do both at once, and converging/integrating/testing code threads requires both great care and diligence…not to mention all those hours. BTW, the reverse is also true…one of the worst applications developers I’ve ever worked with developed some of the best parts of LINUX. It’s simply a different mind set.
-
Quick fixes into finely tuned and highly evolved server code can be even more risky than quick fixes into sophisticated applications. Even well documented server code doesn’t always convey the architectural thinking behind specific design decisions, and while specific choices may be elegant from the performance point of view, the effect on the rest of the code can be subtle.
-
When someone finally had the crazy thought of trying the OOTB Tomcat 6.0.20, the application simply worked. No one knows why, because careful inspection of the area around the client’s proposed bug fix shows absolutely no Tomcat code change. Unfortunately, because of prior code divergence, the client can’t simply jump to “standard” Tomcat, so they are still figuring this all out.
-
Management now understands that they are not really using Tomcat. They are using a (at least one branch, but indications are that there may be many) web applications server that originally came from Tomcat but has diverged from the main line.
As a long time science fiction fan, one of my early favorite stories was by Heinlein, The Moon is a Harsh Mistress. Yes, I know that Heinlein is not “serious science fiction”, but that’s not the point. In that book I was introduced to TANSTAFL, and it’s stuck with me for a (won’t say how) long time. In infrastructure software it does indeed apply.
“There Ain’t No Such Thing As Free Lunch”
It ALWAYS costs…somehow, some way, sometime, and it is really important to make informed decisions so that there aren't unpleasant surprises.
Andy has recently decided to make the jump from individual consulting to join the Spring Source team. He will continue to be working with major clients to assist them with IT architecture evolution, now as a member of a large and growing organization. His first project will be leveraging Tomcat, Spring, and a Tomcat based data grid/cache called GemFire. He’s looking forward to sharing the lessons learned with the tomcatexpert community.
Andy has been architecting, designing, and building enterprise infrastructure and applications software for most of his career. He’s been responsible for BEA’s “Blended Source” initiative, combining the best of Open Source (including both Tomcat and Spring) with WebLogic, BEA’s WebLogic Enterprise Security product family, MEI Software’s financial systems, Netegrity’s SiteMinder security product, Camex’s electronic publishing systems, mainframe applications for Bell Telephone, and many others. During that time his hands on technology experience has ranged from octal coding into neon lighted switches all the way through JAVA and beyond, including many generations of “the best and final thing we will ever need”, and he looks forward to working on the even better things coming in the future. He was involved in the early days of Open Source software as a contributor to EMACS and refocused on Open Source during his tenure as Director of Product Management with BEA Systems, combined with a fascination for the rapidly evolving application deployment architectures and technologies driving today’s development. Andy has provided architecture and technology guidance for both vendors and IT organizations and he shares what he is learning through consulting services and through his web site, Enterprise Software Trends (www.estrends.com).
Comments
Isolate deployments
Why did deploying the patch, which allowed the new application to "passed thru QA with flying colors" cause regressions to legacy applications? Did someone go happy-lucky and decide to push the patch out to all Tomcat instances? If it ain't broke don't fix it -- those instances running the legacy applications should not have received the patch without reason. With proper isolation, the patch might have only effected the intended target with no more fuss.
Sorry for misunderstanding
They didn't put the patch in production, but since their policy is "everything runs on one standard Tomcat", they do push patches into development and test as a way of insuring that they haven't broken anything.
In any large scale enterprise, maintaining a common infrastructure is really important, else you'd have to consider the infrastructure as an integral part of every application, with HUGE maintenance costs.
So, I'd give them good marks there. The problem was that they had diverged and no one with the responsibility had realized it. As a followup, the responsible manager is no longer with the company, in spite of the fact that he had nothing to do with the decisions and was unaware of the mess.
Server mods smell
This is actually a prime example of how modifications to the underlying app server should be considered bad architectural practice.
Great point
One of the major advantages of an Open Source application infrastructure is that you CAN maintain/enhance it, but that absolutely does not mean you should, particularly not in an, "under the covers", uncontrolled way, as was done here.
Bug fixes and enhancements, submitted back into the community for review and potential incorporation into the code base are a good thing. Creating (lots of) local branches should be treated with much fear and trepidation.
So what should they have done?
So what should they have done? I agree it smells bad to self-patch Tomcat. But if they have a hard schedule and they need to roll out, what do they do?
I think I know the answer, I'm just curious what you would say.
So what should they have done?
One of the major advantages of Tomcat is the ability to react to problems and fix them in a timely way...typically FAR faster than closed source commercial products can react. Generating a patch, as this client did, is certainly a viable thing to do, and I'd also give them credit for regression testing the patch to avoid generating branches in the code line. That said, this needs to be done carefully and with everyone understanding the implications.
In this case, the client had already branched away from the Apache distro...unknown to their management. That's generally bad, and should only be done after understanding the ongoing implications.
When fixes are required, and it does happen, it's best practice to involve the community. This can be done during the fix if time allows, and after the fact otherwise, but the best result is if the patch becomes a part of the code line or the problem is resolved in some other way.
In this case, the problem does not seem to have been in the area patched in Tomcat, and wouldn't have been necessary in the first place had the client been tracking the Apache distros.
So, patching is fine if you follow a few simple rules and if everyone involved is aware of the cost/benefit situation. As with any piece of software, patches should be folded back into the main line as soon as possible to insure that your are building on an increasingly stable code base.
Infrastructure rots too easily
I've seen similar scenarios play out many times for shared infrastructure software. Rarely does it get the level of funding and scrutiny that application code does, because of course, it's not often tied to any line-of-business initiative. Managers are only too happy to have the hidden cost of infrastructure maintenance remain hidden.
Whether it's home-built or based on open source or commercial software really makes no difference... it needs disciplined stewardship by managers with strong engineering backgrounds. I know a team with a large set of in-house administrative tools for WebSphere Application Server that quietly, over several years, became a maintenance liability because no one paid sufficiently close attention to what the developers were doing.
I've also managed an open source toolset for a large enterprise, and our policy has by default been "no local patches" because of the spike in maintenance costs it creates for the affected package in terms of builds, testing, upgrades, and coordination with the community. Wherever possible we try to offer the application team a workaround in their own code while we work the problem with the committers.
Jonathan Ross
Post new comment