Developing in Production

The Tweet

The impetus for this blog post came from @copyconstruct:

As an industry we haven't figured out how to enable a good developer experience for building distributed systems.

But if one thing is clear, it's that spinning up a mini version of the *entire* production architecture on a local laptop for development is *not* the solution.
— Cindy Sridharan (@copyconstruct) June 30, 2018

She's right. Developers are in the dark. We can't keep the system on a laptop. But keeping it on a beefy workstation and using localstack to mock out the cloud API doesn't capture the details – the way that Kinesis Data Firehose will slop buffered records into S3 buckets with the wrong timestamp, or the complexities of lambda pricing and data transfer costs. For some reason giving everyone their own personal AWS environment doesn't work. Even keeping a single staging environment doesn't seem to be cutting it.

After thinking about it for a good long while, I wrote this:

We shouldn’t be trying to make small copies of an environment fit on a laptop so we can debug it. We should be shipping code to production and then debugging it.
— Will Sargent (@will_sargent) January 16, 2020

Here's why I think we've reached a tipping point where development and debugging in production can make sense.

The Complexity Case

Complex systems have emergent behavior, producing epiphenomenon that only appears with sufficient scale. A sufficiently large firestorm produces wind systems. A sufficiently large hanger produces clouds and rainfall. Complex systems can involve multiple feedback loops and interactions that can cause unexpected outcomes and misbehaviors that cannot be explained.

In a sufficently complex distributed system – "Something as a Service", running in the cloud – there is enough going on that it's a struggle to replicate a smaller scale staging or QA environment, because the scale of the environment is a factor in its complexity. Even if you understand every single piece of the system in isolation, that doesn't mean you understand the entire system.

Adrian Colyer touches on this in STELLA: report from the SNAFU-catchers workshop on coping with complexity:

The takeaway point here is perhaps that even though the component being worked on is in itself no more complex than any of the existing ones in the system, and takes no more effort to develop and to understand in isolation than any of the existing components, its cost in terms of your overall complexity budget is higher than all those that preceded it.

This comes down to Woods' Theorem:

As the complexity of a system increases, the accuracy of any single agent’s own model of that system decreases rapidly.

If the emergent behavior of the production environment results in a system that is too complex to understand, then modelling a duplicate environment that is "simpler" is futile, because it won't be accurate. Human beings are by and large absolutely awful at modelling complex systems.

Even if you did completely understand the production system, it still wouldn't help. To show you have an accurate understanding of the system, you need to show that the pre-production system has the same emergent behavior. But emergent behavior is in part a function of scale. A duplicate environment created on a smaller scale will have different emergent behavior: a system with three nodes behaves differently than a system with three hundred. A test in staging does not show that it works in production: it proves only that it would work if production worked at the same scale as staging.

Scale and complexity cannot be ignored, because that's where the edge cases happen. And edge cases lurk everywhere.

There's a saying in medical school for new students: "When you hear hoofprints, think of horses not zebras." But in a sufficiently complex system, zebras exist. And they're everywhere, on every level, even in the tools that supposedly report that systems are scaling appropriately. Microbenchmarking suffers from dynamic optimization. Load testing suffers from coordinated omission. Profilers suffer from safepoint bias. Garbage collection blocks on disk writes. Servers suffer unreported gray failure. There are zebras all the way down to the disk spindle – hardware fails, networks partition, and the error handling code is the critical path.

The Drag Factor

There's another problem with pre-production environments: they are infamous for being out of date.

This is odd on the face of it. We aren't lacking for tools, and we've done amazing things as an industry. We've gone from three-tier architecture to N-tier architecture to microservices and service meshes and whatnot. But even with everything – even with AWS, Terraform, Atlantis, Docker, Kubernetes, and the ecosystem surrounding it – it's not like we can wave a hand and create a new environment. In fact, it's harder.

My intuition is that there's a paradox of tooling: the more tools and code that you add to create elements in a system, the harder it is to replicate an environment encompassing those tools and code.

Not only do I think that tooling works against environment independence, but I also think there's a brute mechanical limitation outside of tooling. In programming, we look for DRY code: standing for "Don't Repeat Yourself." This is because copy/pasting the same code in several places requires remembering every single place that the same code is used. It's very easy to forget a detail in a particular method, and it's even easier when you weren't the one who wrote the code in the first place. The more you repeat yourself, the harder it is to keep short term memory together and remember the details. I think something similar is at work in maintaining a production environment. By itself, maintaining the production environment involves details, lots of them.

Every environment that you maintain above and beyond Terraform or Chef is a drag on operational efficency. It's copy and paste, it's having to keep things together, it's fixing things when they break, applying patches and upgrades and running migration scripts. Meanwhile, the production environment is sophisticated enough and under enough constant observation that it becomes more attractive to do things in production than set up an environment that will be a huge drain on productivity and won't work anyway.

Tyler Treat touches on this in More environments will not make things easier:

“We need to run everything with this particular configuration to test this, and if anyone so much as sneezes my service becomes unstable.” Good luck with that. I’ve got a dirty little secret: if you’re not disciplined, no amount of environments will make things easier.

This is why giving everyone their own personal AWS environment doesn't work, because it just increases the drag factor. Instead of creating a single staging environment which is not up to date and contains a trivial set of sample data, you have many environments which are all out of date in different ways, and every developer has to maintain it themselves.

As Charity Majors says, “Staging is a black hole for engineering time.” You can see the full talk from QCon 2018 – go about 13 minutes for the staging discussion.

This is how we end up with testing in production.

Testing in Production

Testing in Production consists of several stages, each stage reversible and under close observation. At the core of testing in production is the idea of splitting deployments (of artifacts) from releases (of features).

Publish artifacts containing your new feature (using a new feature flag) to your artifact repository.
Deploy artifacts to production. Because the new feature is behind a feature flag, it will be disabled. Do not batch up multiple features in a deployment. Ship one change at a time.
If the rollout results in things being on fire, rollback to the previous published artifact.
Once you have verified that the rollout is complete and nothing is on fire, you can test the feature in production.
Begin by releasing the feature on one server, using the feature flag system. See what that server does.
If it works fine, release it on more servers. If it doesn't, roll it back.
If it's all good, release on all servers.

Needless to say, Charity Majors covers all of this in testing in production as a superpower. Cindy Sridharan goes into more detail in Testing in Production the Safe Way and follows that up with Testing in Production - the Hard Parts. And finally Kristian Köhntopp talks about doing this for over a decade at booking.com with Rolling back a rollout.

Note specifically that there are some requirements here for testing in production. You have good CI/CD practices. You have feature flags. You're capable of chaos engineering. You have developers looking at logs in production and verifying that a feature works as intended – there is no "glass castle" where production is considered to be too delicate to be touched.

However, this is production, and it's treated as production. Testing in production is a proving ground: there are experiments asking a question, with carefully measured results. If something goes wrong you'll get paged, and rule #0 is still Do Not Be On Fire.

As a developer, this is no fun at all.

A Playground with Guard Rails

The ideal developer experience starts off with "runs in seconds" unit tests to "runs through minutes" integration tests to "runs overnight" load tests in a production-like environment before moving to a canary rollout and production experiments. Developers also need the safety of being able to write buggy code and test edge cases for a single component in an isolated production-like environment without worrying about feature flags, running through a full CI deployment process, having to reserve access to a shared system, or being paged.

Most importantly, developers need to be able to lean into the complexity of the system, so that the behavior of an individual component can be accurately modeled in production, while still being isolated.

Beau Lyddon makes the same point in What is Happening: Attempting to Understand Our Systems, saying "Provide the the ability to experiment and test in production" around the 34:15 mark:

The video is excellent and worth watching in its entirety, but there is a TL;DR slide:

So. You've only got one environment: production. You're going to need a space in it for developers where they can play around with individual components of the system and not fuck things up for everyone else.

The Garden Hose

The first thing you need is production traffic. Not a lot of it: you don't need a firehose. Instead, you need a garden hose of traffic, just enough to be fun and interesting without being overwhelming.

Test in production can use traffic shadowing, but shadowing by itself typically doesn't target or limit traffic. Shadowing is incredibly useful on its own terms – you can keep a buffer of all requests over the last 10 minutes and if there's any kind of hiccup, you can save it off automatically – but what you really you really want as a developer is something that can sniff, mask, store and replay traffic as it comes in and store interesting things.

That's where Goreplay comes in. Using Goreplay Monitoring, you can create a dark traffic system that can be used later. You can modify requests and mark them as playground data using Middleware. You can ensure requests are played in the correct order. And with the Pro version, you can replay GRPC and Protocol Buffers.

Yep, only skimmed through article, but we used similar techniques at work - teeing prod traffic to a dev instance with tons of debug logging turned on (prohibitive in prod) and replaying traffic at different speeds to reproduce really strange anomalies ...
— Cindy Sridharan (@copyconstruct) January 27, 2018

There's also an AWS Lambda version called ShadowReader; I'm less familiar with that but it can show memory leaks from replays which is what you want from a replay system.

There's also Envoy Service Tap Filter, which looks interesting but is still not built for general purpose middleware logic – it doesn't do fractional sampling or dynamic configuration, for example. Envoy also has a traffic shifting filter which does shadowing using the "request_mirror_policy" policy. This is less flexible than the service tap, but has been around longer and is useful for high volume services.

Once you've got that source of data, you need to send it somewhere.

The Playground Sandbox

The playground sandbox (for brevity I'll just call it the playground here) is a decently sized CPU instance that is provisioned and deployed by a continuous delivery service on demand when the developer pushes a commit to a specific repository. There is limited read access to specific systems in production, and there is no absolutely write access or feedback loop into the production system as a whole. There are packet filters in place so that if there are external URLs and callbacks outside the system, the playground cannot access them.

It doesn't go through any kind of continous integration testing, and there's no artifact versioning going on. It runs an individual service, and gets traffic from the garden hose. The developer can save off and replay streams of data from the playground so that interesting behavior be repeatable and reproducible.

The playground should not be bulkheaded – for accurate reproducibility, it needs to be as identical to a live service as possible, and that means sharing the same resources, with the same latency and load.

The playground has tags on it saying who it belongs to, when it was built, and when it's going away. A lambda function rolls through every so often and kills playgrounds that weren't shutdown in time. Cost estimates and budgeting are worked into the system using DCE to ensure there are no cost overruns or orphaned resources.

In addition to strict role-based access control set up with principle of least authority, a kill switch can shutdown all the playgrounds immediately in the event of excessive or pathological queries, or just because production doesn't have the resources right now.

As such, the playground doesn't have to be especially powerful. It doesn't have to be persistent. It just has to be reachable and observable by the developer. There is the ability to make it arbitrarily large and powerful and draw on more resources from the system, but that happens only by request.

James Ward points out that this is what Salesforce does with a range of developer sandboxes, ranging from metadata-only sandboxes, to sandboxes that have full replicas of the production database and can take days to set up.

I am aware of kubernetes development tools such as Tilt, Draft, Skaffold and Garden. I don't know much about them, but my feeling is that they don't address the problem of reproducing behavior in a system with emergent behavior at scale. Telepresence looks like a better fit, but has the issue of sending data over the wire to a local laptop, which could be expensive if you don't have Direct Connect set up, and can potentially introduce timing and latency issues connecting to the local laptop.

State and Interactions

Keeping state together and interacting with services that aren't in the playground is the tricky bit. Christian Posta covers some of the "gotchas" of running a playground in Shadowing Patterns for Microservices. There's a lot that he covers, but the options presented:

Service Virtualization – using a fake service endpoint for interactions. This is different from using a simple stub, and there are many companies out there that sell service virtualization products.
Synthetic Transactions – using real services but applying a tag that means the transaction is rolled back.
Datastore Virtualization – using a virtualized layer over the production database that allows for writes but does not propagate them outside the playground. There's a series of blog posts in more detail.
Datastore Materialization – using a full on database that picks up changes from the production database using change-data-capture. This looks really interesting, especially with new projects like DBLog coming out.

This is made much easier if you have a system that can draw a distinction between an event and persistence of that event. Posta goes into detail in The Hardest Part About Microservices: Your Data:

As we’ve been saying, for microservices we value autonomy. We value being able to make changes independent of other systems (in terms of availability, protocol, format, etc). This decoupling of time and any guarantees about anything between services in any bounded time allows us to truly achieve this sort of autonomy (this is not unique to computer systems… or any systems for that matter. So I say, between transaction boundaries and between bounded contexts, use events to communicate consistency. Events are immutable structures that capture an interesting point in time that should be broadcast to peers. Peers will listen to the events in which they’re interested and make decisions based on that data, store that data, store some derivative of that data, update their own data based on some decision made with that data, etc, etc.

This is also easier when you have a system that can work with events natively. I really like what Lagom does here, because it comes with an out of the box persistence model based on event sourcing and CQRS, and handling internal vs external communication. I am also interested in Cloudstate's abstraction over state using Serverless logic. I am an ex-Lightbend employee, so I'm familiar with this stack. If you're new to all of this, event sourcing needs careful thought, and you should read Implementing Domain-Driven Design and Reactive Messaging Patterns.

In general, you don't need to worry about masking personal identifiable information from the database, because the playground exists in production, and that data never leaves the instance. There are database proxies that can do anonymization and data masking if that's a concern.

Tenant-based architectures have also been mentioned as a solution. I tend to feel that tenancy is a security construct that's oriented to customers, and may not have the flexibility needed for developers, but if you are already multi-tenant then I can see how it would make isolation easier.

No End to End

The Playground only covers one service. This is deliberate. There is no end-to-end testing here. End-to-end development crossing over multiple systems is a recipe for non-deterministic behavior. It's a rathole.

Tyler Treat adds in More environments will not make things easier:

What are we to do then? With respect to development, get it out of your head that you can run a facsimile of production to build features against. If you need local development, the only sane and cost-effective option is to stub. Stub everything. If you have a consistent RPC layer — discipline — this shouldn’t be too difficult. You might even be able to generate portions of stubs.

So, no cross-service development. In a services based environment, your responsibility stops with the service endpoints, which you can stub out and define through consumer driven contracts.

Even if you're not stubbing and you are actually sending data out of your service, as a developer service you should have no downstream effect – anything that is externally visible should be flagged as "tracer bullets" that is a visible no-op downstream. Beau Lyddon covers this in What is Happening around 32:55 mark.

This is because you can't know what goes on outside your bounded context. You have no control over what happens before data comes in, or what happens after your data goes out. You don't even know if the data you send or receive is in the right order.

This means that when you're developing code for a service, the service itself is the boundary. As Kelvin Wahome writes in Microservice Testing: A New Dawn:

Domain logic often manifests as complex calculations and a collection of state transitions. Since these types of logic are highly state-based there is little value in trying to isolate the units. This means that as far as possible, real domain objects should be used for all collaborators of the unit under test; a unit of test.

Are there going to be larger scale feature interaction bugs? Yes. But feature interaction bugs are a sign of boundary error, not programming error. It won't be fixed until you can sit two different services teams down and work out the interactions between services through requirements gathering. That's not a fixable problem by itself, and isn't something you catch in development.

The Workspace

Developers need a space that can provide good security, good network access, and short deploy times to the service. We don't want it to be on the service itself, but we also don't want the developer to be bound to the physical limitations of their laptops.

What I think we can do instead is provide a Amazon Linux Workspace.

I think that remote instances have a number of advantages over developing on a laptop:

Developers are guaranteed the same experience no matter what machine you use.
IT and Security have greater control over keeping sensitive information secure.
Developers can scale up their remote instances when they need to do heavy compiling, while not overheating their laptops.
Much easier for devops to keep scripts and configuration consistent between developers.
Developers can set up multiple workspaces as necessary.

The workspace comes with IntelliJ IDEA, and can clone, work with and commit code. It has access to the playground, so it can tail logs and bounce services as necessary. It lets you run AWSume to manage session tokens and assume role credentials. It can even use socat and ssh to attach a debugger or profiler and connect a REPL. It's more persistent than the playground is, but will be auto-stopped when the developer isn't using it.

Why I Think This Works

I think this is a better overall solution overall for the following reasons:

Developers can get immediate feedback from running code in a playground.
Access to recent production data is not an issue, because it's mirrored in near real-time.
Ensuring the security and privacy is not an issue because it never leaves the production environment.
Emulating the production environment performance profile is trivial.
Playground instances are ephemeral and sandboxed, so they require less maintainance compared to a staging/dev environment.
Moving from unit tests to functional tests to playground to "testing in production" to full rollout is a gradual process of expanding out the environment.

Needless to say, this comes with the caveat that it's only a workable solution when you are already at a "test in production" level of operational awareness, and there can be other drivers that may lead you to set up environments for specific teams: I recommend Making Sense of Environments for the details there.

I know I've barely scratched the surface of this. As per one reaction:

Revolutions are usually messy, and this is a revolutionary point of view. Still, quite a few interesting points.
A couple tricks might make sense, adapted: e.g. don't spin up the whole system in dev or deploy dev code in prod, but insert your dev instance into the load balancer https://t.co/JrknBK0Z8V
— le Chep (@c_chep) January 17, 2020

So, this is messy. Revolutions are messy. But they're possible. And at some point they happen.

infrastructure devops

The Tweet

The Complexity Case

The Drag Factor

Testing in Production

A Playground with Guard Rails

The Garden Hose

The Playground Sandbox

State and Interactions

No End to End

The Workspace

Why I Think This Works

Comments

Related Posts

Published Recipe Agent 27 Jul 2025

Useful LLM Agent Tools 21 Jun 2025

Making An LLM That Just Works For My Brother 13 Apr 2025