Something came through email a while back that I feel compelled to share. A quote that gave me pause:
“I suspect, if the servers are left to run without any application or database services or processes, they will run happily without incident.”
Brilliant! About the most blinding flash of the obvious I’ve seen for a year.
The person who wrote the message was working as a service provider; a high end system provider, using complex environments, etc. The quote is akin to saying “it’s be a great job if not for the customers”.
To be perfectly honest when I started working with IT systems too many years ago I sometimes felt like this (sometimes), but the years have taught me that it is a very unhelpful attitude. Its a message that I think diverts from proper diagnostics (although perfect world is nice to keep in mind), and also hints that individual would rather be doing something else.
In this case we are chasing a significant performance issue which causes system outages, and the customer is expecting both my company and the service provider to keep at it, find the root cause, and get it resolved. That’s my goal – fix it so that it never happens again.
This case is interesting in that the issue did not occur for three months, then some changes occurred (see pts 1-5 below), and then these crashes started. The changes were:
- A peripheral web service was added to the set of web services, to populate some data automatically. A minor code change to a peripheral web service, running on different hardware (standard windows stuff).
- A footer on the reports output from the system had a tag line changed. A trivial impact change.
- All applications running on Unix boxes were re-installed from scratch on new servers. There are also some windows servers hosting some services which connect to these unix servers, but they have not changed.
- The new servers were also placed into a new co-lo hosting area, previously not used by our systems which requires new hosting rules, firewall rules and settings, etc.
- The networking between the co-lo locations has been changed a few times to facilitate various hardware moves for other customers of the hosting provider.
I’m being told to spend a fair amount of time and energy to investigate points 1 and 2, and told that points 3 – 5 are being taken seriously, but are likely to be “fine”.
Right. Major hardware, hosting, and network change is unlikely to affect a high transaction volume system. However a change to a report footer, or a small web service are the likely cause.
Righto, off to the madhouse for me I guess.