Why It’s Time to Test in Production (+ How to Do It Safely)

For the longest time, “testing in production” has been considered an absolute no-no. Developers would rather test in development and staging environments and deploy only when they know everything works perfectly.

But this approach is also holding your team back. Today’s software landscape is dramatically different:

  • You’re deploying code multiple times a day—not once a quarter
  • You’re using a microservices architecture, which means hundreds of dependencies
  • Your users expect continuous improvement BUT without service interruption
  • You’re piecing together third-party integrations/APIs that aren’t easy to replicate in a staging environment

Case in point: In 2021, we had our first API outage, which lasted 44 minutes. We did everything right. Tested the code in a staging environment. Used an identical tech stack. Got green checks all the way. It came down to a data discrepancy between the staging and live environment, and things broke down.

We learned our lesson back then, but this is still a problem for many engineering teams. That’s why you need to test in production with the proper guardrails in place.

In this article, we’ll explore how you can test in production and do it safely with feature flags.

Skip ahead to how to test in production with feature flags.

What is testing in production (and should you delete your staging environments to do this)?

Testing in production refers to the practice of validating software behaviour, performance, and functionality in the live environment. It’s the same environment where users interact with your application.

Many teams use three environments: development, staging, and production. In this case, you’ll deploy the code without staging it.

Honestly, this is a scary thing to do. Many developers have concerns such as:

  1. Risk of disruption: The fear that testing could cause downtime, data corruption, or degraded service for real users is entirely valid. No one wants to be responsible for breaking the customer experience.
  2. Data integrity concerns: Testing activities could potentially pollute production data—especially if it’s not isolated well.
  3. Increased pressure and accountability: When testing happens in production, the stakes are higher. Mistakes are visible to everyone—customers, management, and teammates.
  4. Replication challenges: Paradoxically, while teams fear testing code in production, many developers struggle with issues that only manifest in production environments and cannot be replicated elsewhere.
  5. Regulatory and compliance risks: In regulated industries, production testing could violate compliance requirements if you don’t manage them carefully.

Despite these risks, the reality is that testing in staging is not necessarily the safer option anymore.

“Traditional testing is becoming harder. You used to have one server, one database, one web service/application and possibly an API connection to a payment gateway,” explains Ben Rometsch, Flagsmith’s Co-Founder. “Applications now have a bunch of APIs they’re connected to, maybe 2-3 data stores they’re working on, 3-4 runtime services they’re running. This is great. They’re more capable, flexible, easy to develop, and powerful. But it means that the difficulty of getting a replica of that environment as closely as possible is increasing every day.”

Feature flags let you test in production safely, allowing you to control the visibility of features at runtime and make changes without redeploying code. You can:

  • Deploy code without immediately exposing it to users
  • Test features in production with only internal teams
  • Roll out features gradually to increasingly larger audiences
  • Turn off problematic features immediately without rolling back deployments

They decouple deployment from release. As a result, you get fine-grained control over who sees what and when. Even if you keep your staging environment (and there are good reasons to do so!), testing in production is still a good practice.

What are the differences between testing in production and traditional testing methods?

Previously, software testing followed a linear progression where you wrote the code, tested it, and deployed it in different environments.

But testing in a live environment doesn’t work this way. Here are the differences:

Traditional Testing Testing in Production
Environment fidelity Attempts to create staging environments that mirror production as closely as possible Acknowledges you can’t replicate live environments and tests in the actual production environment (with proper controls)
Feature exposure Features are fully deployed to an environment or not at all Uses feature flags for granular control, allowing deployment to production while controlling who sees new features and when
Risk management Front-loads risk mitigation through extensive pre-production testing Distributes risk management across the entire lifecycle and focuses on detecting problems faster
Testing scope Focuses on functional correctness and predetermined test cases in controlled environments Looks at the larger picture—including real-world performance, actual user behaviour, and production-scale edge cases
Time to market Each environment represents a gate, potentially adding days or weeks to release cycles Accelerates delivery by deploying to production behind feature flags, letting you validate during development
User behaviour Simulates user behaviour through predetermined test cases Exposes features to real users interacting naturally with the system
Third-party integrations Attempts to mock or simulate external dependencies Tests against actual third-party services with real behaviours and limitations
Infrastructure Creates separate environments with similar but not identical infrastructure Uses the exact production infrastructure, eliminating configuration discrepancies
Recovery approach Relies on catching issues before production—recovery requires new deployments through the pipeline Enables immediate mitigation via feature flags without requiring new deployments

What are the benefits of testing in production with feature flags?

Instead of viewing production as the finish line where testing ends, you need to view it as a critical part of your testing strategy. Here are a few reasons why:

You can test with real user conditions and data volumes

Production environments are messy. You’ll see a huge range of user behaviours, edge cases, and data patterns that the application handles. Even if you replicate it perfectly in a staging environment, it’ll never fully represent this complexity.

That’s why you need to test in production. Your team can uncover edge cases and rare conditions that a staging environment will never surface. Similarly, the sheer data volume and velocity will bring up issues you hadn’t considered before. This approach lets you monitor application behaviour under real conditions while controlling the final user impact.

Your tests will be more accurate and reliable

No matter how much you try to mimic your production environment, there’s a good chance you’ll miss something. With TIP,  you’ll test code in production i.e., in real network conditions, with an actual system load, and with authentic user behaviour patterns.

As and when you see any issues, you know that these are real issues that would’ve affected your users anyway. So, if you’re using a feature flag to control that test, say by only exposing a feature to a small subset of users via a phased rollout, you can contain the impact by just toggling it off. As a result, your team has more real-world data around testing and its impact.

You’ll iterate faster and reduce your time to market

Typically, developers pass code through multiple environments before deploying it to users. This means you’re spending too much time in each release cycle. But if you test in production, you’ll shorten this feedback cycle.

You can deploy new code into production but keep it invisible to your users while the testing phase is ongoing. Once you’ve validated the code, roll it out to a small segment of users, test the behaviour, and roll it out completely.

Feedback loops with feature flags (Source)

Instead of testing, fixing, testing, fixing, and then deploying, you’re doing it simultaneously. As a result, you can reduce your time to market by delivering value to users faster.

You can take advantage of trunk-based development

Software testing in production with feature flags naturally complements trunk-based development. You don’t have to create long-lived feature branches that result in complex merges. Instead, you can work on small, incremental changes that integrate into the main branch frequently.

How trunk-based development changes the production process (Source)

Also, feature flags keep incomplete work hidden from users while allowing it to be deployed to production. This approach reduces merge conflicts, encourages smaller code changes, and prevents the “integration hell” from long-running feature branches. Your teams can collaborate without stepping on each other’s work while controlling which features users access.

You can reduce mean time to recovery (MTTR)

Production incidents happen all the time. If you’re testing in a production environment with feature flags, you can identify incidents faster and nip them in the bud.

Feature flags give you an immediate rollback option should you need it. You don’t have to run a full rollback—which could involve multiple changes. All you have to do is turn off the problematic feature and reduce the incident’s MTTR.

You can also use a more sophisticated emergency response strategy. For instance, you might initially disable a problematic feature for all users. You can then gradually re-enable it for internal testing, then small user segments, and finally, the entire user base once the issue is resolved.

How can engineering teams use testing in production to prevent disaster?

Proper production testing can prevent catastrophic failures. The 2024 Crowdstrike outage was just one example of that. A faulty update to its Falcon sensor caused a crash in Windows systems worldwide. This single deployment led to global IT outages affecting airlines, banks, healthcare systems, and critical infrastructure.

If the Crowdstrike team gradually rolled out the feature using feature flags, they would’ve identified the problem before it affected their entire customer base.

The timeline of the 2024 Crowdstrike outage (Source)

The question is: how do you avoid a similar incident? Here are a few ways to do that:

1. Try to minimise negative impact on end users

Testing in production is often conflated with “We’re going to expose all our features without quality checks.” But that’s not true. You can use the following strategies to mitigate this:

Use progressive delivery with feature flags 

Gradually roll out features to a small segment, test, validate, and then roll them out fully. There are several ways in which you can do that:

  • Phased rollouts: Allow teams to expose new features to larger segments of users gradually. You can limit the “blast radius” of any issue before it becomes a revenue-draining problem.
  • Canary deployments: Take a more targeted approach by directing a small percentage of traffic (often 1-5%) to the new version while routing the majority to the stable version. This creates an early warning system that quickly identifies issues before they impact most users.
image.png
Source
  • A/B and multivariate testing: Goes beyond basic on/off toggles to compare multiple implementations of a feature with real users. You can pick the winning implementation that’ll remain in the live environment.
  • Test in production with synthetic users: You can create automated scripts that simulate real user behaviour. In this case, you can still use feature flags to test the feature but remove the risk of rolling it out to paying customers.
  • Shadow testing techniques: Also called “dark launching”, you can process production traffic through new code paths without returning results to users. The system compares the old and new implementation responses to identify any differences.

2. Use monitoring and alerting mechanisms to detect issues in real-time

The ultimate goal of any kind of testing is to remove as much risk as possible. Even then, you can’t achieve 100% risk mitigation. That’s why you need to add a layer of continuous monitoring to detect issues when they happen. You can do this by:

  • Tracking response time, error rates, and resource utilisation to spot performance degradation.
  • Measuring how users interact with your application through session recordings or accurate user monitoring (RUM).
  • Using tracing mechanisms to follow requests across microservices to pinpoint root causes.
  • Implement an observability platform like Grafana to identify unusual patterns that could lead to service disruptions.

If you have the right alerting thresholds and escalation paths, your team can jump in and turn off flags as needed.

3. Ensure data privacy and security during testing

Even when you’re testing in production, you can’t risk the confidentiality or security of your own (and customers’) data. Feature flags give you all the control you need but could lead to many issues if you don’t have the right security measures.

This is why engineering teams use role-based access control (RBAC) to restrict who can create, modify, and toggle the flags. You can add specific user roles and permissions for each role to ensure only the right stakeholders have access.

Other than that, maintain detailed logs of all your feature flag changes. If you’re troubleshooting a problem, you can see what changes were made and work your way from there.

How to test in production using feature flags

The first thing you need to do is set up a feature flag. Here’s how you can do that in Flagsmith.

Next, think about the lifecycle of the feature flag. Remove flags once testing is complete unless they serve as kill switches or long-lived flags. As a result, you’ll avoid any technical debt that arises from these flags. And you won’t have to deal with a rogue flag that somebody turns on accidentally.

When you have a process down for testing, automate the flag controls. You can do that by:

  • Generating feature flags when you create new branches
  • Tying the flag state to specific deployments
  • Triggering tests based on specific criteria and rolling them out to test users
  • Automatically increasing feature exposure when your tests reach a performance threshold

Using this approach, you’ll turn a seemingly risky strategy into a competitive advantage.

What are the best practices for testing in production environments?

Now that you know how to test in production, let’s look at how you can make the most of it:

1. Establish clear testing objectives and success criteria

Define your testing objectives before you implement this process. Document what you’re testing and why. It’ll give you some guidance on how to approach the testing process and what to measure. For example, performance tests focus on response times, while feature validation focuses on user completion rates.

Also, make sure your success criteria (and failure conditions) are specific and measurable.

NOT: “This feature should help the user achieve XYZ goal”.

BUT: “API response times remain under 200ms at 95th percentile”.

It removes ambiguity when deciding if it’s time to deploy the feature to the entire userbase.

2. Implement version control and rollback mechanisms

Always version your feature flags and code to maintain alignment between flag configurations and the code they control. So, treat the flag definition as code and store it in your version control system.

Also, use the toggles to roll back features automatically when needed. You can do that by using automated rollbacks based on monitoring alerts. Or implement a “break glass” procedure for emergencies. These procedures include:

  • Who has the authority to trigger emergency disablement (RBAC)
  • How to disable the feature (including direct database access if necessary)
  • Communication templates for notifying stakeholders
  • Post-incident analysis procedures

The version control and rollback mechanism eventually make the root cause analysis faster.

3. Collaborate with cross-functional teams and stakeholders

Before and during the testing process, get alignment on who can access and make changes to the code/flags. Typically, the following teams are involved:

  • Developers: Responsible for implementing feature flags
  • Quality assurance: Responsible for validation testing
  • Operations: Responsible for monitoring production metrics
  • Product: Consulted on rollout decisions

So, build operating procedures around this and implement access controls accordingly. Make sure you have regular touchpoints like daily standups or status checks during active testing periods. Then, you don’t have to worry about missing any observations or concerns and can adjust testing plans as needed.

Deploy features with confidence by testing in production

Ultimately, no staging environment can perfectly replicate production conditions, even if you sculpt it yourself.

If you want to mitigate these challenges, consider switching to testing in production—with feature flags. You’ll be able to decouple deployment from release and control the release of every feature without hesitation.

The future of software quality doesn’t lie in a simulated environment. It lies in controlled testing in an environment that truly matters—your production environment.

Quote

Subscribe

Learn more about CI/CD, AB Testing and all that great stuff

Success!
We'll keep you up to date with the latest Flagsmith news.
Must be a valid email
Illustration Letter