Chasing symptoms, not cause

7 November, 2007 – 10:39 am

Any job in which you need to ‘fix’ something requires you to correctly analyze the root cause. Treat just the symptoms and you’ll find yourself on the never-get-fixed roundabout …

Performance tuning fits this profile neatly. In an array of available metrics, how do you avoid chasing the symptoms and never identifying root cause?

Sit on the fence. It’s better to stay impartial to the problem than take sides. Often performance problems, especially in production environments carry high stakes on successful resolution. There will parties who have a strong interest in resolving the problem quickly (normally the service provider). Other parties will be less pragmatic. Some parties will be offended that a problem actually exists. Some will be in denial. In fact you’re reasonably guaranteed that the full suite of human reactions will be attached to this performance problem you are trying to analyze. Performance problems alone carry some shock value, and I guess human conditioning is quite varied when ‘reacting’ to such a situation. So sit on the fence, try not to favour vendor A over vendor B, it’s not your job to take sides in any case, just reveal where the problem actually exists (and hopefully what to do about it).

Don’t cover old ground. To avoid doing this you need to stay organised. Come up with a naming convention for your test sets and data. Keep things organised so you can quickly review historical data. Collect more than you think you need, as you may need to come back to that data, and it’s better to have measured all major performance metrics rather than just a few centered around your hunch at that point in time. I think this is a key contributor to forever chasing symptoms, when you’re not quite sure what you tested 2 weeks ago, and you need to go and re-test again.

Keep a log. Write down what you’ve tested. Keep a log of date and times, what you were trying to look for, what you actually found. Keep a running list of hypotheses (you will form many) and as you rule them out, write down your reasons for doing so. Don’t forget assumptions you made along the way. Often when you finish solving the performance problem, you’ll look back with the benefit of hindsight and wonder why you didn’t spot it earlier. Is there anything you could have done differently to speed up your own analysis and come to an earlier conclusion?

Replicate it. Can you replicate this problem in another environment? Can you break this problem down into smaller parts? Can you virtualize the problem? I’ve often found recreating a dummy environment on a virtual machine can be great. Often much of the software you are testing will have limited trial editions. It can greatly assist your understanding if you replicate the problem on smaller (and more controllable) environments rather than tackling that huge production beast head on. This is particularly useful when access is typically limited on a production environment.

Follow a hunch (but be systematic). There is no problem following your own intuition or gut feeling, nor is there any problem following someone else’s hunch. The majority of the time key staff will already have an idea of where the problem might exist, based on their own experience with the system under test and more often than not, a better understanding of the problem history.

Create a hit list. Experience alone will teach you what area you should probably concentrate on. My own generic hit list in priority order is disk, network, CPU then memory. But that order can change depending on what the SUT is and what middleware you are testing. Exclusively Java? Maybe you should elevate memory usage in your hit list. MQ persistence? Keep disk at the front of your line up. DB performance? Time to pick on CPU and disk. Replication? Keep network front and ready. This doesn’t fit all situations, but it helps to have a little method to your madness.

Understand the history. Sometimes when the problem is reported to you, or you are brought onto the project in a contract role, you will be handed the poison chalice …

“This system has really crap response times because the disk subsystem is hopeless. The sysadmin has already checked out memory and CPU utilization, and he thinks it’s all disk related …”

That doesn’t give you history, you need to reflect on defect records, talk to people impacted by the performance problems, speak to the people who have worked on the problem, get historical data and approach each problem with a fresh mind. Try not to rely on historical data, just understand what the history provides and what may have been missed already.

Beware the black box. Often there will be parts of your SUT that are just outside the reach of your monitoring efforts, so it’s easy to treat them as a black box and look at global symptoms such as overall response times for transactions and the like. As soon as you work your way through your hit list of metrics, and you still can’t find an obvious sign of a bottleneck, it’s time to take that black box apart, or request help from subject matter experts that can. Common black boxes I encounter are the WAN, firewalls, proxies and SANs. More often than not these external components are the contributing factor to your performance problems. So it’s best to systematically work your way through the component that you can actively monitor, rule out any chance of obvious bottlenecks, then systematically analyze your inputs/outputs to those black boxes that form your SUT. That forms the impetus for further investigation of those entities. At about this time, slip on your asbestos suit and have your historical data ready for scrutiny …
;)

Correlate. You need to correlate the information you receive from one performance metric with other metrics available. Never look at one performance metric alone. Can you explain the relationship between one metric and another?Relationships matter. Find the odd one out and chances are you’ve found your bottleneck. Jump at every observed increase or spike in isolated metrics and you will be forever chasing your tail.

Know your metrics. Know what performance counters are available. Know your operating system. I’ve found that previous windows environments has polluted my understanding of some performance counters, made by my own assumptions which just didn’t work out for me on a Solaris environment. Buy a book, blog it, google it, wiki it, understand it. After all this is your bread and butter.

Know your thresholds. If you don’t know your thresholds, you’re not easily going to be able to identify performance metrics that appear to be the odd one out. I’m not just talking about obvious thresholds either, like sustained CPU utilization greater than 80%. You’re not required to be a walking wiki of performance metrics and thresholds though. If you get the chance you should try and baseline the idle system (or correctly configured system under test) to give yourself a yardstick to measure by. Is the performance problem limited to the production environment only, or does it affect all other environments as well? Don’t be put off by undersized alternate environments either. Aside from obvious differences, sometimes pre-production environments might offer striking similarities e.g. CPU set may be exactly the same, it’s just RAM and disk that differs in production. Since most performance metrics have some form of relationship with other components, you can often pre-empt what acceptable performance is for a given metric. For example, environment A and environment B have the same CPU set, but their disk subsystems vary, A is on a NAS and B is on a SAN. You think the problem is disk related, but there are a whole bunch of performance counters related to the CPU (syscalls, context switches etc) that can give you a good idea of file system performance, at least from the perspective of the CPU set.

Sort out your tools. Being able to understand symptoms and recognize patterns or behaviour requires access to a good set of tools. No, you don’t need LoadRunner Analysis to be a good performance tester. Something simple like Excel will do. The important part to understand is how your tools work, and how to use them efficiently. I often fall back to using Excel to analyze data because I’m reasonably guaranteed that I can share that analysis easily with other subject matter experts, it affords me flexibility in terms of charting, it has a reasonable arsenal of statistical analysis methods and it’s reasonably easy to import data from a variety of sources. I’ve long been meaning to get into a dedicated statistics package like ‘r’, but it remains one of those in-between-contract learning opportunities I haven’t had the chance to catch up on …

Later I might update this post with some actual examples. But for now, please accept my soapbox.

Share it: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Netscape
  • Reddit
  • Slashdot
  • Technorati
  • YahooMyWeb
  1. 2 Responses to “Chasing symptoms, not cause”

  2. Great blog, lot of great insights which will prove helpful during problem analysis.

    By Alfie on Nov 7, 2007

  3. I learnt the lesson about fence sitting… “Boss - we found the root cause to the performance issues - oh look Tim found some problems”. I heartily agree with this post, not just for performance testing, but for ALL IT support people.

    By Ted on Nov 7, 2007

Post a Comment

*
To prove that you're not a bot, enter this code
Anti-Spam Image