TABLE OF CONTENTS

Industry/News Company Updates Best Practices and How To Languages & Technologies Product Customer Stories

Why It’s Time to Test in Production (+ How to Do It Safely)

Tanaaz Khan

For the longest time, “testing in production” has been considered an absolute no-no. Developers would rather test in development and staging environments and deploy only when they know everything works perfectly.

But this approach is also holding your team back. Today’s software landscape is dramatically different:

You’re deploying code multiple times a day—not once a quarter
You’re using a microservices architecture, which means hundreds of dependencies
Your users expect continuous improvement BUT without service interruption
You’re piecing together third-party integrations/APIs that aren’t easy to replicate in a staging environment

Case in point: In 2021, we had our first API outage, which lasted 44 minutes. We did everything right. Tested the code in a staging environment. Used an identical tech stack. Got green checks all the way. It came down to a data discrepancy between the staging and live environment, and things broke down.

We learned our lesson back then, but this is still a problem for many engineering teams. That’s why you need to test in production with the proper guardrails in place.

In this article, we’ll explore how you can test in production and do it safely with feature flags.

Skip ahead to how to test in production with feature flags.

What is testing in production (and should you delete your staging environments to do this)?

Testing in production refers to the practice of validating software behaviour, performance, and functionality in the live environment. It’s the same environment where users interact with your application.

Many teams use three environments: development, staging, and production. In this case, you’ll deploy the code without staging it.

Honestly, this is a scary thing to do. Many developers have concerns such as:

Risk of disruption: The fear that testing could cause downtime, data corruption, or degraded service for real users is entirely valid. No one wants to be responsible for breaking the customer experience.
Data integrity concerns: Testing activities could potentially pollute production data—especially if it’s not isolated well.
Increased pressure and accountability: When testing happens in production, the stakes are higher. Mistakes are visible to everyone—customers, management, and teammates.
Replication challenges: Paradoxically, while teams fear testing code in production, many developers struggle with issues that only manifest in production environments and cannot be replicated elsewhere.
Regulatory and compliance risks: In regulated industries, production testing could violate compliance requirements if you don’t manage them carefully.

Despite these risks, the reality is that testing in staging is not necessarily the safer option anymore.

“Traditional testing is becoming harder. You used to have one server, one database, one web service/application and possibly an API connection to a payment gateway,” explains Ben Rometsch, Flagsmith’s Co-Founder. “Applications now have a bunch of APIs they’re connected to, maybe 2-3 data stores they’re working on, 3-4 runtime services they’re running. This is great. They’re more capable, flexible, easy to develop, and powerful. But it means that the difficulty of getting a replica of that environment as closely as possible is increasing every day.”

Feature flags let you test in production safely, allowing you to control the visibility of features at runtime and make changes without redeploying code. You can:

Deploy code without immediately exposing it to users
Test features in production with only internal teams
Roll out features gradually to increasingly larger audiences
Turn off problematic features immediately without rolling back deployments

They decouple deployment from release. As a result, you get fine-grained control over who sees what and when. Even if you keep your staging environment (and there are good reasons to do so!), testing in production is still a good practice.

What are the differences between testing in production and traditional testing methods?

Previously, software testing followed a linear progression where you wrote the code, tested it, and deployed it in different environments.

But testing in a live environment doesn’t work this way. Here are the differences:

	Traditional Testing	Testing in Production
Environment fidelity	Attempts to create staging environments that mirror production as closely as possible	Acknowledges you can’t replicate live environments and tests in the actual production environment (with proper controls)
Feature exposure	Features are fully deployed to an environment or not at all	Uses feature flags for granular control, allowing deployment to production while controlling who sees new features and when
Risk management	Front-loads risk mitigation through extensive pre-production testing	Distributes risk management across the entire lifecycle and focuses on detecting problems faster
Testing scope	Focuses on functional correctness and predetermined test cases in controlled environments	Looks at the larger picture—including real-world performance, actual user behaviour, and production-scale edge cases
Time to market	Each environment represents a gate, potentially adding days or weeks to release cycles	Accelerates delivery by deploying to production behind feature flags, letting you validate during development
User behaviour	Simulates user behaviour through predetermined test cases	Exposes features to real users interacting naturally with the system
Third-party integrations	Attempts to mock or simulate external dependencies	Tests against actual third-party services with real behaviours and limitations
Infrastructure	Creates separate environments with similar but not identical infrastructure	Uses the exact production infrastructure, eliminating configuration discrepancies
Recovery approach	Relies on catching issues before production—recovery requires new deployments through the pipeline	Enables immediate mitigation via feature flags without requiring new deployments

‍

What are the benefits of testing in production with feature flags?

Instead of viewing production as the finish line where testing ends, you need to view it as a critical part of your testing strategy. Here are a few reasons why:

You can test with real user conditions and data volumes

Production environments are messy. You’ll see a huge range of user behaviours, edge cases, and data patterns that the application handles. Even if you replicate it perfectly in a staging environment, it’ll never fully represent this complexity.

That’s why you need to test in production. Your team can uncover edge cases and rare conditions that a staging environment will never surface. Similarly, the sheer data volume and velocity will bring up issues you hadn’t considered before. This approach lets you monitor application behaviour under real conditions while controlling the final user impact.

Your tests will be more accurate and reliable

No matter how much you try to mimic your production environment, there’s a good chance you’ll miss something. With TIP, you’ll test code in production i.e., in real network conditions, with an actual system load, and with authentic user behaviour patterns.

As and when you see any issues, you know that these are real issues that would’ve affected your users anyway. So, if you’re using a feature flag to control that test, say by only exposing a feature to a small subset of users via a phased rollout, you can contain the impact by just toggling it off. As a result, your team has more real-world data around testing and its impact.

You’ll iterate faster and reduce your time to market

Typically, developers pass code through multiple environments before deploying it to users. This means you’re spending too much time in each release cycle. But if you test in production, you’ll shorten this feedback cycle.

You can deploy new code into production but keep it invisible to your users while the testing phase is ongoing. Once you’ve validated the code, roll it out to a small segment of users, test the behaviour, and roll it out completely.

Feedback loops with feature flags (Source)

Instead of testing, fixing, testing, fixing, and then deploying, you’re doing it simultaneously. As a result, you can reduce your time to market by delivering value to users faster.

You can take advantage of trunk-based development

Software testing in production with feature flags naturally complements trunk-based development. You don’t have to create long-lived feature branches that result in complex merges. Instead, you can work on small, incremental changes that integrate into the main branch frequently.

‍

How trunk-based development changes the production process (Source)

Also, feature flags keep incomplete work hidden from users while allowing it to be deployed to production. This approach reduces merge conflicts, encourages smaller code changes, and prevents the “integration hell” from long-running feature branches. Your teams can collaborate without stepping on each other’s work while controlling which features users access.

You can reduce mean time to recovery (MTTR)

Production incidents happen all the time. If you’re testing in a production environment with feature flags, you can identify incidents faster and nip them in the bud.

Feature flags give you an immediate rollback option should you need it. You don’t have to run a full rollback—which could involve multiple changes. All you have to do is turn off the problematic feature and reduce the incident’s MTTR.

You can also use a more sophisticated emergency response strategy. For instance, you might initially disable a problematic feature for all users. You can then gradually re-enable it for internal testing, then small user segments, and finally, the entire user base once the issue is resolved.

How can engineering teams use testing in production to prevent disaster?

Proper production testing can prevent catastrophic failures. The 2024 Crowdstrike outage was just one example of that. A faulty update to its Falcon sensor caused a crash in Windows systems worldwide. This single deployment led to global IT outages affecting airlines, banks, healthcare systems, and critical infrastructure.

If the Crowdstrike team gradually rolled out the feature using feature flags, they would’ve identified the problem before it affected their entire customer base.

‍

The timeline of the 2024 Crowdstrike outage (Source)

The question is: how do you avoid a similar incident? Here are a few ways to do that:

1. Try to minimise negative impact on end users

Testing in production is often conflated with “We’re going to expose all our features without quality checks.” But that’s not true. You can use the following strategies to mitigate this:

Use progressive delivery with feature flags

Gradually roll out features to a small segment, test, validate, and then roll them out fully. There are several ways in which you can do that:

Phased rollouts: Allow teams to expose new features to larger segments of users gradually. You can limit the “blast radius” of any issue before it becomes a revenue-draining problem.
Canary deployments: Take a more targeted approach by directing a small percentage of traffic (often 1-5%) to the new version while routing the majority to the stable version. This creates an early warning system that quickly identifies issues before they impact most users.

Source

A/B and multivariate testing: Goes beyond basic on/off toggles to compare multiple implementations of a feature with real users. You can pick the winning implementation that’ll remain in the live environment.
Test in production with synthetic users: You can create automated scripts that simulate real user behaviour. In this case, you can still use feature flags to test the feature but remove the risk of rolling it out to paying customers.
Shadow testing techniques: Also called “dark launching”, you can process production traffic through new code paths without returning results to users. The system compares the old and new implementation responses to identify any differences.

2. Use monitoring and alerting mechanisms to detect issues in real-time

The ultimate goal of any kind of testing is to remove as much risk as possible. Even then, you can’t achieve 100% risk mitigation. That’s why you need to add a layer of continuous monitoring to detect issues when they happen. You can do this by:

Tracking response time, error rates, and resource utilisation to spot performance degradation.
Measuring how users interact with your application through session recordings or accurate user monitoring (RUM).
Using tracing mechanisms to follow requests across microservices to pinpoint root causes.
Implement an observability platform like Grafana to identify unusual patterns that could lead to service disruptions.

If you have the right alerting thresholds and escalation paths, your team can jump in and turn off flags as needed.

3. Ensure data privacy and security during testing

Even when you’re testing in production, you can’t risk the confidentiality or security of your own (and customers’) data. Feature flags give you all the control you need but could lead to many issues if you don’t have the right security measures.

This is why engineering teams use role-based access control (RBAC) to restrict who can create, modify, and toggle the flags. You can add specific user roles and permissions for each role to ensure only the right stakeholders have access.

Other than that, maintain detailed logs of all your feature flag changes. If you’re troubleshooting a problem, you can see what changes were made and work your way from there.

How to test in production using feature flags

The first thing you need to do is set up a feature flag. Here’s how you can do that in Flagsmith.

Next, think about the lifecycle of the feature flag. Remove flags once testing is complete unless they serve as kill switches or long-lived flags. As a result, you’ll avoid any technical debt that arises from these flags. And you won’t have to deal with a rogue flag that somebody turns on accidentally.

When you have a process down for testing, automate the flag controls. You can do that by:

Generating feature flags when you create new branches
Tying the flag state to specific deployments
Triggering tests based on specific criteria and rolling them out to test users
Automatically increasing feature exposure when your tests reach a performance threshold

Using this approach, you’ll turn a seemingly risky strategy into a competitive advantage.

What are the best practices for testing in production environments?

Now that you know how to test in production, let’s look at how you can make the most of it:

1. Establish clear testing objectives and success criteria

Define your testing objectives before you implement this process. Document what you’re testing and why. It’ll give you some guidance on how to approach the testing process and what to measure. For example, performance tests focus on response times, while feature validation focuses on user completion rates.

Also, make sure your success criteria (and failure conditions) are specific and measurable.

NOT: “This feature should help the user achieve XYZ goal”.

BUT: “API response times remain under 200ms at 95th percentile”.

It removes ambiguity when deciding if it’s time to deploy the feature to the entire userbase.

2. Implement version control and rollback mechanisms

Always version your feature flags and code to maintain alignment between flag configurations and the code they control. So, treat the flag definition as code and store it in your version control system.

Also, use the toggles to roll back features automatically when needed. You can do that by using automated rollbacks based on monitoring alerts. Or implement a “break glass” procedure for emergencies. These procedures include:

Who has the authority to trigger emergency disablement (RBAC)
How to disable the feature (including direct database access if necessary)
Communication templates for notifying stakeholders
Post-incident analysis procedures

The version control and rollback mechanism eventually make the root cause analysis faster.

3. Collaborate with cross-functional teams and stakeholders

Before and during the testing process, get alignment on who can access and make changes to the code/flags. Typically, the following teams are involved:

Developers: Responsible for implementing feature flags
Quality assurance: Responsible for validation testing
Operations: Responsible for monitoring production metrics
Product: Consulted on rollout decisions

So, build operating procedures around this and implement access controls accordingly. Make sure you have regular touchpoints like daily standups or status checks during active testing periods. Then, you don’t have to worry about missing any observations or concerns and can adjust testing plans as needed.

Deploy features with confidence by testing in production

Ultimately, no staging environment can perfectly replicate production conditions, even if you sculpt it yourself.

If you want to mitigate these challenges, consider switching to testing in production—with feature flags. You’ll be able to decouple deployment from release and control the release of every feature without hesitation.

The future of software quality doesn’t lie in a simulated environment. It lies in controlled testing in an environment that truly matters—your production environment.

About the author

Flagsmith contributing writer.

April 7, 2025

A Conversation with Komerční Banka's Chief Software Architect

Mia Loiselle

March 26, 2025

GitOps for Feature Flags Using Terraform and Terrateam

Malcolm Matalka

January 22, 2025

How We Improved Our Docker Image Security Using Chainguard's Wolfi

Kim Gustyr

January 7, 2025

6 Best Enterprise-Grade Split Alternatives & Competitors

Tanaaz Khan

October 28, 2024

How to Roll out Pricing Changes With Zero Customer Complaints

Matthew Elwell

September 16, 2024

How to Use Feature Flags for Trunk-Based Development

Kyle Johnson

August 21, 2024

7 Best LaunchDarkly Alternatives & Competitors

Tanaaz Khan

August 12, 2024

How Global Banks Use Feature Flags to Stay Competitive

Tanaaz Khan

July 24, 2024

How To Guide: Flagsmith Grafana Integration

Pradumna Saraf

July 23, 2024

New in Flagsmith: 2024 Feature Roundup

Matthew Elwell

July 23, 2024

Don’t Let a Flawed Release Take Your Company Down

Ben Rometsch

June 26, 2024

How to Guide: Flagsmith GitHub Integration

Pradumna Saraf

May 28, 2024

6 Best Firebase Remote Config Alternatives & Competitors

Tanaaz Khan

May 16, 2024

How to Transition to Modern Feature Management in Banking

Ben Rometsch

March 21, 2024

5 Feature Flag Management Pitfalls To Avoid To Keep Your Flags in Check

Tanaaz Khan

February 29, 2024

The Best Thing about Founding a Remote-First Company? Pickled Onion Monster Munch and The Beautiful Game

Ben Rometsch

February 28, 2024

Flagsmith Jira Integration Guide: A Comprehensive How-to Guide

Abhishek Agarwal

February 16, 2024

Guide: How to Create Observability-Driven Development with Feature Flags

Savan Kharod

January 31, 2024

Build vs. Buy for Feature Flags: My Experience as a CTO with a 20+ Engineer Team

Daniel Engelke

January 16, 2024

Announcing the Flagsmith Referral Programme

Anna Redbond

January 15, 2024

How We Measure Feature Flags’ Success

Kyle Johnson

December 20, 2023

Customer Story: Serenis

Anna Redbond

December 7, 2023

Announcing the Flagsmith Jira Integration

Anna Redbond

June 6, 2024

Spring Boot Feature Flags: A Step-by-Step Implementation Guide with a Working Java Spring Boot Application

Abhishek Agarwal

November 22, 2023

Employees on Bootstrapping

Anna Redbond

November 14, 2023

Our POV: When Bootstrapping Works (and When It Doesn't)

Anna Redbond

October 25, 2023

How to Onboard Feature Flag Management Tools

Anna Redbond

October 12, 2023

When is it time to move to feature flag software?

Olga Diaz

September 26, 2023

Why We Bootstrap

Ben Rometsch

September 6, 2023

The Enshittification of Basically all Digital Design. But in this Case, Specifically, the Slack Redesign.

Ben Rometsch

January 9, 2025

Ruby Feature Flags: A Step-by-Step Guide to Implementing Feature Flags in a Ruby on Rails Application

Zeeshan Afridi

September 1, 2023

Unlocking Efficiency: Transitioning to Modern CI Processes

Geshan Manandhar

August 29, 2023

Customer Story: Vontobel

Anna Redbond

August 17, 2023

It's Time to Move to Modern Observability Tools and Progressive Delivery: Insights from Dynatrace

Andreas (Andi) Grabner

August 2, 2023

Moving to Modern Software Development and Continuous Integration for Banks: Insights from Romano Roth (Zühlke)

Anna Redbond

August 1, 2023

Developer-Led Podcast: Bootstrapping a Commerical Open Source Company to $1M ARR

Anna Redbond

July 24, 2023

Open Source Startup Podcast: Why Feature Flagging Should be Open Source with Ben Rometsch

Anna Redbond

July 20, 2023

Get The Analytics You Need: A/B Testing with Feature Flags and Your Existing Stack

Kyle Johnson

July 18, 2023

Open-Source in Banking: Rob Moffat from FINOS Talks Barriers, Benefits, and Pushing the Battleship to Adoption

Anna Redbond

June 30, 2023

Customer Story: Rain (VP of Platform Engineering)

Anna Redbond

June 30, 2023

Customer Story: Rain (Tech Lead)

Anna Redbond

September 26, 2024

PHP Feature Flags: A Step-by-Step Guide in a Working Laravel Application

Geshan Manandhar

January 15, 2025

What is Canary Deployment? When and How To Use It

Geshan Manandhar

October 10, 2024

Node.js Feature Flags: a Step-by-Step Implementation Guide with an Express.js Example

Geshan Manandhar

June 3, 2021

Integrate Heap with Flagsmith

Ben Rometsch

April 30, 2021

Security Benefits of Self-Hosting Feature Flags On-Prem | Flagsmith

Geshan Manandhar

April 15, 2021

Best Practices to Achieve Automated Testing & Zero Downtime Deployments

Ben Rometsch

April 1, 2021

Deployment is not a release; a step-by-step guide with feature flags

Geshan Manandhar

November 25, 2024

Feature Flags vs Remote Configuration: What’s the Difference?

Ben Rometsch

December 14, 2020

Get the most out of your Feature Flags with these best practices

Ben Rometsch

December 1, 2020

Customer Story: Palo Alto Software

Ben Rometsch

March 14, 2020

What I’ve learned creating a React Native performance monitor

Kyle Johnson

September 20, 2024

How to Setup Feature Flags in Android using Kotlin

Shubham Aggarwal

June 8, 2023

Customer Story: Smartex

Anna Redbond

May 26, 2023

Our First Remote Company Off-Site: What Worked, What Didn’t, and What We’ll Do Differently Next Time

Anna Redbond

May 19, 2023

Customer Story: Wistia

Anna Redbond

April 28, 2023

A Decision Continuum: Deciding Between Feature Flagging Software vs. an In-House Solution

Anna Redbond

May 8, 2023

Customer Story: Rabbit Care

Anna Redbond

April 18, 2023

Customer Story: alt.bank

Anna Redbond

February 23, 2023

The actual infrastructure costs of running a global Edge API (part 2)

Ben Rometsch

May 3, 2023

Integrating your Flagsmith Project with Datadog: A Step-By-Step Guide with Real-Time Metrics

Abhishek Agarwal

May 10, 2024

Python Feature Flags & Toggles: A Step-by-Step Setup Guide in a Flask Application

Matthew Elwell

May 2, 2024

Java Feature Flags & Toggles: A Step-by-Step Guide with a Working Java Application

Abhishek Agarwal

November 16, 2022

Adventures in Terraform: How and why we built our Terraform Provider

Gagan Trivedi

April 8, 2025

Angular Feature Flags: a Step-by-Step Guide with a Working Application

Geshan Manandhar

January 30, 2025

Golang Feature Flags: A Step-by-Step Implementation Guide with a Working application

Abhishek Agarwal

June 29, 2022

Elixir feature flags: a step-by-step guide with an Elixir example

Ben Rometsch

June 6, 2022

How Banks Implement Feature Flags - Interview with KB Bank | Flagsmith

Ben Rometsch

June 16, 2022

.NET feature flag: a step-by-step guide with Xamarin example

Ben Rometsch

June 14, 2022

Our scariest release to date!

Ben Rometsch

June 15, 2022

The actual infrastructure costs of running SaaS at scale (billions of requests/month)

Ben Rometsch

January 2, 2022

How To Use Swift Feature Flags: iOS App with code examples

Ben Rometsch

May 11, 2022

Our CI/CD and release management process at Flagsmith

Ben Rometsch

January 21, 2022

How eFuse Uses Flagsmith for A/B & Multivariate Testing

Ben Rometsch

May 19, 2022

Flagsmith Submits OpenFeature as CNCF Sandbox Project | Flagsmith

Ben Rometsch

November 17, 2021

Using Flutter Feature Flags to Release Features Without Risk | Flagsmith

Ben Rometsch

May 24, 2024

How to Use JavaScript Feature Flags & Toggles to Deploy Safely [React.js Example]

Ben Rometsch

December 31, 2021

6 Metrics to Monitor When Rolling Out a New Feature Flag

Cassandra Polzin

September 29, 2021

How Inflow Improves Conversions Through A/B Testing with Flagsmith and Mixpanel

Ben Rometsch

October 7, 2021

5 learnings going from open source project to commercial open source business

Ben Rometsch

April 25, 2024

Feature Flags Best Practices: The Complete Guide

Geshan Manandhar

September 23, 2021

Decoupling Deployment from Release with Feature Flags

Cassandra Polzin

July 8, 2021

Use feature flags to release code safely in any git branching strategy

Geshan Manandhar

July 2, 2021

Feature Flag Analytics for users of Flagsmith and Amplitude

Ben Rometsch

August 20, 2021

How to Enhance Phased Rollouts with Feature Flags

Cassandra Polzin

October 1, 2024

React Native Remote Config: A Step-by-Step Implementation Guide

Geshan Manandhar

June 29, 2021

Decouple deployment from release to achieve continuous delivery with Feature Flags

Cassandra Polzin

June 23, 2021

Integrate New Relic with Flagsmith

Cassandra Polzin

June 21, 2021

Flagsmith & AppDynamics Enable Advanced Performance Analysis

Cassandra Polzin

May 5, 2021

Introducing Multivariate Feature Flags to enable seamless AB Testing and Canary Deployments

Ben Rometsch

June 11, 2021

Monolith vs. Microservice architecture: Embracing the Monolith safely with feature flags

Ben Rometsch

December 8, 2020

Flagsmith Release! v2.4.0

Ben Rometsch

February 1, 2020

Self Hosting all the things

Ben Rometsch

December 29, 2021

Is it time to delete your staging environment?

Ben Rometsch

January 11, 2021

My Mac Setup - 2020/21: Getting close to OS nirvana

Ben Rometsch

April 8, 2021

New Dynamic Flags combine the benefits of Feature Flags and Remote Config

Ben Rometsch