November 10, 2017
TABLE OF CONTENTS
I’m taking a brief excursion from my usual identity and API-centric posts to answer a question about performance tuning that someone asked me earlier this year. In a previous incarnation of my career, I was focused on performance tuning and diagnostics—particularly involving Java systems. However, the same principles can apply to just about any running system. This post explores how to approach load testing and performance tuning for just about anything.
Performance Testing (and Tuning) is generally best done with an approach similar to Black Box Testing, which involves testing a system without any specific knowledge of the internals of the system—or, at least approaching the problem in this fashion. In all likelihood, you are going to have a fairly detailed idea of what the system architecture and internals of your system looks like. This may not be true of SaaS, PaaS, or third-party vendor applications where you don’t have source code access.
This approach works with everything (including vendor and SaaS apps). Note, most SaaS providers (and third-party hosting providers) don’t necessarily want you load testing their systems—unless specific provisions are made (legally, operational support, and system capacity standpoint).
So, one is not examining source code at this point—that comes later. We are interested in the overall behavior of the system (including the code running in it). I was at a client site a few years ago, working directly with an application development team and doing this type of work; the lead developer had trouble understanding why my performance tuning methodology didn’t start with reviewing code.
This approach to performance tuning entails:
Step #8 is the hard part.
Given my background in middleware, I tend to start towards the bottom of the technology stack and start working my way up until I find the problem. Eventually, I get to the application code. Sometimes, that proves more efficient; sometimes, starting at the application code and working your way down is more efficient.
One of my peers approaches this activity in the exact opposite fashion that I do; he will start looking at the code first, and then work his way down into the system until he finds the problem. This peer’s skill set includes being very proficient at using Java debuggers to efficiently find problems. Use the approach that comes most naturally to you. Not everyone has the mindset or patience for this activity; your organization should employ someone who does.
When a bottleneck or error is encountered, start at the place where the issue is observed. Then, move deeper into the system as needed to identity the root cause. Maintaining discipline in this activity tends to pay off in the long run.
I described this methodology as similar to black box testing, but there is an important difference. The testing team is aware of the system internals (to some level), often has access to the source code, but approaches testing as though they don’t have these things.
When problems start surfacing, the team uses the knowledge of the system (system internals, application code, etc.) to troubleshoot the problem. It is likely that the team members applying this knowledge to troubleshoot problems is different from the team members running the load tests.
It’s necessary to understand what is being load tested. In fact, the more familiar you are with the system you are trying to tune, the more successful you will be (and most likely, the easier it will be) in your tuning efforts.
Create a document that captures all of this information before you begin creating load test scripts or running load tests.
The following people (or roles that one or more people may fill) are needed for the load testing phase of a project. Some of these roles may be filled by the same person. Some may not be relevant to all situations. Sometimes, more than one person is needed to bring all the necessary knowledge to the table for one of these roles. Multiple people may be needed, too. Additional resources could be needed:
Before load testing begins, some information must be gathered. This includes:
That last one is the real trick. If this is a brand new system, then you are essentially guessing. If this is an existing system with a point upgrade or minor update, then there is real world data that can be used to figure out relative usage of various transaction to build a load test.
The basic idea is to generate load, identify bottlenecks, eliminate bottlenecks, and repeat until the desired SLA is obtained with the desired load. Getting through even one iteration of this cycle can take anywhere from hours to days (if not longer for new or complex systems).
The steps described in the “Prep Work” section above must be completed before you begin. If you don’t have this information, success will be elusive. Or, you will simply be tracking it down a little ways down the road.
You need to produce load. Choose your load testing tools. Ideally, your organization would already have tooling and hardware in place. Use these tools to record a load test that mimics expected production usage.
Run load tests to produce sufficient amount of traffic to meet the defined SLA. You will incrementally approach this target.
Most of the time when I have done this, the first several iterations of running a small amount of load encounters various issues ranging from identity and access issues for test accounts to network connectivity.
The test engineer must get the test to a point where it can run at 1–2 TPS without any errors occurring. For a brand new system, this can be challenging. Keep at it and you will get there. But, be realistic; I’ve seen this take anywhere from a couple of days to two-six weeks, depending on the complexity of the system.
Once errors related to configuration and application bugs have been eliminated, you can begin increasing load. At some point, you will hit your first load related issue.
Unfortunately, I cannot tell you how to resolve those issues. These could be at any layer/tier of the application. It could be in system configuration, middleware configuration, system resources, application code, or other pieces. As a first step, narrow down where the problem is:
Having a good understanding of the software the application is built on top of is a necessary precondition to doing this successfully. It doesn’t have to be the load test engineer who has these skill sets. Other Subject Matter Experts (SMEs) can be brought in as necessary to monitor and troubleshoot problems.
Once the current issue is resolved, run the load test again. Keep repeating this cycle until the system can run a sustained load test for at least several hours (four to six) at the desired load level.
As issues are encountered, it can be difficult to separate the cause and effect. Which observation is the root cause and which is a side effect? This comes with experience.
Keep a detailed log of configuration changes that are made when troubleshooting this process. It seems obvious, but sometimes people forget to be diligent about this and then have no idea what fixed it. Only change one thing at a time! Even if it seems like a small detail, only apply one change at a time.
Before new systems, or changes to existing systems, go live, performance tuning needs to be done. Performance tuning is a process. It’s a journey—not a destination. You could always spend more time to make the system a little more efficient, but is it needed? This is where defining the required SLA (or target performance level) ahead of time is needed. This will tell you when it is not only good, but that it is good enough.
Robert C. Broeckelmann Jr.
At the end of lunch with a mentee, I used the items on our table to express the fundamental concepts of Kubernetes. Sometime after explaining the purpose of the Kubernetes scheduler, she asked a question I spent the next several weeks thinking about.
API design is crucial, giving structure to application interaction. Given cross-functional teams and applications, development time is reduced with a clear, intuitive way to access data. API development often follows two approaches: REST and GraphQL.
As of June 2018, the state of California passed a new privacy law that could lead to more consequences for US-based companies than the European Union’s General Data Protection Regulation (GDPR). Here's what you need to know and how to be compliant.
Before your data scientists wring value out of your reams of data, it has to be accessible and, on some basic level, coherently arranged. To harness all that brainpower, you need to keep the data wrangling to a minimum. Enter the data lake.