Tracer Bullet Testing or Synthetic Transactions or Synthetic Monitoring is a way of testing your service/app in production where its supposed to run but without affecting users/clients/external systems. Somewhat oddly enough, it doesn’t really feature on Martin Fowler’s testing pyramid, I feel it certainly deserves a position up there, perhaps as a level of integration testing.
Such a testing strategy is necessitated because your dev machine or the CI server is at best a rough approximation of the production environment. Containers have tried to eliminate this variance to a certain extent but ultimately they too run on commodity hardware which fails all the time or behaves ever so slightly differently than your extreme development workstation.
Networks, databases, a differently configured firewall rule, any of these -and more – could throw a spanner in the works and there is just no way to simulate these kinds of failures on a dev environment.
Unit and stub driven behavioural testing though essential should only be a starting point. Once you have done all the usual stuff, testing in production should be the ultimate line of defence.
You have to be careful with testing in production though, if most or all your operations are read-only then you might be ok, otherwise you will need to do some thinking and this would depend on your domain.
We implemented Tracer Bullet Testing in my team about 6 months ago for our core warehouse restocking workflow and here’s what we did.
- We determined a unicorn product group i.e. a product group that sells very little or is very seasonal or otherwise not used for actual restocking. The reason for selecting one was that we restock by product groups for efficiency reasons.
- We identified the critical path in our workflow, the flow that processed products, generated proposals and automatically submitted purchase orders for restocking. This path involved around 4 services and additional service that publishes business metrics and KPIs.
- We then identified exit points from each of these services to external services that published those messages and metrics. These are services owned by other team that we didn’t want to publish the tracer bullet data to and affect their workflow. These exit points were: the business metrics publisher that published to BigQuery and the purchase order sender that sent proposals users accepted to the purchase order service owned by another team.
- We then established an ambient context in each of our service’s entry point for the selected critical path, this ambient context would then help us determine whether or not we are running in Tracer Bullet mode by looking at the product group id that we are currently processing. If its the unicorn product group, then we don’t call domain external services and we don’t publish any metrics (technical or business).
This ambient context was established using the AsyncLocal<T> type in .NET Core like so:
We then looked at the POST request header (for REST services) or message header (for services that used asynchronous message queueing), looked at the product group id from the payload and if it was the target group, set the context active:
This data would then be available throughout the scope of that request/use case flow.
- Running tests in production is uselesss if you don’t come to know when something fails, to address this, we added tag based logging to our DataDog platform. DataDog would then alert us on Slack if any errors are recorded under that tag . Its a simple strategy but it works quite well.
- We then defined a schedule (9AM-9PM) which fell outside of the daily run window (7:30AM – 8:30AM) and then we simply invoked our usual core workflow (via the same REST endpoint that’s used in normal runs) every 30 mins using a CRON scheduled Lambda.
This gave us ample coverage over time to uncover any problems in our critical path that might get introduced by any of the dozens of PRs we push to master everyday. This allowed us to take appropriate actions to fix problems before they cause an outage in production.
The diagram above shows services A, B, C and D (inside the dotted box) that are internal to the domain whereas things outside the dotted box are external services and systems. Red arrows indicate the interactions that we want to prevent during the tracer bullet test runs.
The only downside to this at the moment is that it doesn’t cover all the other paths in our landscape for e.g. some of the user initiated actions are still to be covered by the Tracer Bullet tests and we’ll be working on them next. But as it is, this strategy has been quite useful for us because its helped us trap regressions in production, additionally, its also been useful to track other production metrics like latencies, errors etc.
You can also create appropriately detailed monitoring dashboards to track tracer bullet runs and related technical metrics because chances are you would want to separate them from your normal business dashboards. We haven’t done this yet, but we do publish a special boolean tag that can allow us to filter metrics.
On the whole a strategy definitely worth investing in if you are serious about quality and reliability of your microservices.