Blog
November 10, 2015
TABLE OF CONTENTS
I am pleased to announce the first release of the Message Simulator, a lightweight automation tool for helping harden your cluster. This first version is targeted for RabbitMQ clusters. The Message Simulator’s source code is hosted on GitHub: https://github.com/GetLevvel/message-simulator. Now we can start using our shiny new cluster we built in the previous post Testing RabbitMQ Clustering Using Docker – Part 1.
When things break on production it helps when you have seen it before and already know how to fix it!
Having confidence your highly available cluster is going to handle, perform, and scale is a complex task requiring expertise beyond just knowing the core clustering technology. Understanding how your client applications will utilize your cluster and then deciding on how to balance tradeoffs between performance, resilience, and reliability for your cluster can require iterations of testing, deployments, debugging, and downtime. To reduce the effort it takes to battle test a cluster, we get asked questions that help find that balance faster. Usually these questions are about:
This post will not be long enough to address all of these, but these questions are why we are open sourcing the Message Simulator. If you want to have confidence your cluster will handle your specific messaging requirements then it needs to be tested like any other component in your production stack. The benefits from testing your cluster with your exact messaging needs means you can decide where your cluster will stand in regards to: bottlenecks, velocities, thresholds, optimizations, support overhead, and build out your “how do we restore services” runbook before someone gets paged in the middle of the night.
Hosting your own Highly Available RabbitMQ cluster is not complex, but knowing your cluster can handle your exact needs is not something the documentation is going to teach alone.
We initially built the Message Simulator to evaluate the performance hit for having a 3-node RabbitMQ cluster configured to auto synchronize after a crash.
Figure 1 – Simulating a Cluster Node Crash
Similar to the Netflix Simian Army, we wanted a way to crash clustered RabbitMQ brokers in creative and extensible ways. This led us to start building a way to reliably test this process, and with today’s release the entire simulation of external events, broker entities, and messages are outlined in one JSON file. Each JSON file is a self-contained Messaging Simulation Model for regression testing your cluster. Once we centralized everything into a file, we could group the files based off the messaging use case they simulate. This led us to organize the simulations by type and purpose. With today’s first release there are:
The simulator works with any new or existing RabbitMQ cluster and can run on any system that supports python. The simulator could run outside of your data center or on a VM beside the cluster. So long as there’s a connection, you can start running messaging simulations. Each simulation uses a different messaging pipeline and route map which means you can run multiple simulations at the same time. If you want to run multiple Load simulations while running a High Availability simulation and all the while letting your producing and consuming client applications use your cluster, well you can do that too.
We wanted a tool to beat up a cluster and see what happens to our client applications, and so we made it easy to model your messaging traffic while doing terrible things to it in real time.
For those interested in utilizing a RabbitMQ cluster:
What happens when you simulate a network outage by blocking the cluster’s internal communication port it uses to talk to the other cluster nodes? When does the cluster realize it lost a node? How do you restore it? What kind of reporting tools can detect this?
Want to see if your cluster is ready? You can find out with this simulation. (If things go horribly awry you can take a look at our Troubleshooting section for putting the pieces back together.)
You can run a simulation with:
$ ./run_message_simulation.py -f Path_to_Simulation_File
Read more about how to run simulations here: How to Run Simulations
Not all clusters need to support the same type of message traffic (A low latency, high response application vs a safety system requiring no message loss). To keep things generic, we built the simulator to take in a Message Simulation modeled in a JSON file. Each JSON Simulation Model must implement the following sections:
{
"Simulation" : {
"Name" : "Your_Name_For_This_Simulation",
"Type" : "Rabbit",
"Rabbit" : {
}
},
"Consumers" : { },
"BrokerEntities" : {
"Exchanges" : [ ],
"Queues" : [ ],
"Bindings" : [ ],
"Messages" : [ ]
}
}
Read more about Building Your Own Simulation
For Message Simulation specifics please refer to these sections:
To make simulations into a regression test that we could always run again, we went with the convention that we had to process everything in sequence the same way every time we ran it. To do this we made the simulator support sending AMQP messages from the same list as Custom Event messages. Event messages allow for the cluster to be modified outside of normal AMQP messaging operation (producing, routing, consuming). ConsiderHigh Availability Test 2 that performs these simulation steps in order:
This test starts by introducing a simple amount of message load, crashes a node, sees if the cluster can still route messages, restores the crashed node, and then checks if messaging still works when the cluster’s third node comes back online. The goal is to make the simulation JSON flexible and generic so we can focus on writing JSON tests instead of modifying the underlying code to test a cluster. The Message Simulator currently supports these Custom Events and Message Types:
Inevitably a simulation will end up breaking your cluster in some unexpected ways. Finding out how to restore services before it is mission critical is always a good exercise, and that’s why we added a simple guide on troubleshooting for restoring your RabbitMQ cluster back to normal operation. While it is not comprehensive for all cases, it is focused on getting your cluster quickly back up so you can test if the new configuration changes are more stable than the previous version.
Read more about Troubleshooting
The goal of testing High Availability is to validate that your cluster can meet your client applications’ messaging demand (hopefully with a large set of 9’s). The Simulator includes running Stress and Load tests, and now we are interested in continuing to build out more extensive High Availability simulations.
We have started a list of High Availability Simulations Coming Soon. Let us know if you would like to see a specific High Availability simulation.
For now the focus will be on:
As time goes we will be keeping the most updated list here: https://github.com/GetLevvel/message-simulator#ha-tests-coming-soon
Well that’s it for this post. We are pretty excited to hear your feedback on this automation tool, and hopefully you find it valuable. Let us know if you would like to have specific simulations added to the GitHub repository (https://github.com/GetLevvel/message-simulator) and if your organization would like assistance determining your RabbitMQ clustering strategy, please reach out to us at Levvel and we can get you started. If you do not have a cluster to run some of these simulations, the previous post can help get you going with your own Docker RabbitMQ Cluster: Testing RabbitMQ Clustering Using Docker – Part 1.
The next post will continue exploring High Availability simulations for the purposes of integration with a large, distributed framework utilizing a RabbitMQ cluster as a core component.
For the ambitious of you out there, I included an easter egg inside the Message Simulator’s repository that is a ‘How To Guide’ for the framework I will be discussing in the next post. See if you can find it!
- Jay
Authored By
Jay Johnson
Meet our Experts
Let's chat.
You're doing big things, and big things come with big challenges. We're here to help.