Automated Testing & Zero Downtime Deployments
Committing your code and having it appear a few minutes later in a running environment, without any hiccups or anyone noticing, is something of a holy grail of modern software engineering. The long term destination for many teams is pushing code straight to production, multiple times a day. For many however, the journey is more important than that destination.
We have identified 5 pillars that you need to establish to be in a position to Delete Your Staging Environment. Some of these are more process oriented, and others are more technical. Achieving regular, zero downtime deployments is a critical pillar, and leans hard on technical aspects, dev ops and infrastructure tooling. If you are running an older platform that has some legacy infrastructure to it, this pillar can take a lot of work to implement, but the rewards will continue for the lifetime of the product.
Let’s dive in.
Deployments in 1999
When I worked for a digital agency back in 1999, production deployments literally took months. Why? Because you started your first production release by ordering server hardware! Boxes arrived, you cut your hands getting the servers racked in, and then started on the arduous process of installing all the relevant dependencies of your platform by hand. There was literally no tooling around these processes back then.
Once you had managed to get the hardware installed and the first release live, things became a little easier, but performing application upgrades was still a long, manual process that was fraught with danger:
- There was no concept of automated builds. You had to compile, package, transfer and deploy your code manually.
- There was no structure around testing, and almost none of it was automated. Testing was often a bunch of people sitting in a room clicking around on the website trying to break things.
- There was no “elastic” infrastructure, and you generally didn’t have the luxury of spare servers sat around. That meant releasing a new version was a case of stopping the web server, copying your new code onto the server and starting the web server again.
- Controlling things like load balancers often meant going into a data center with a weird cable and a laptop. Ditto routers, domain name controllers and so on.
All of this meant that releases were infrequent, painful, slow, error prone, often done at night and generally everyone hated doing them. It’s interesting that from the list above, one by one they have been solved, advancing the state of the art and making the life of engineers and product managers easier.
Adopting best practices in each of these areas can get you to the point where production releases are so common and frequent that your team doesn’t even know when they are happening.
Whether you are starting a new project from scratch, or have a legacy application that was started many years ago, basing your development practises on the below will reap rewards as time goes on.
Automate your Builds
What nirvana looks like? Every commit of your code is automatically tested, built, packaged, artefacted and deployed within a few minutes. Tests are repeatable and dependable. Notifications of failures are real time and relevant.
This is generally the “CI” part of “CI/CD”: Getting code from your text editor into a state where it is ready to be deployed into your infrastructure. The key here is predictability and repeatability. Just to recap the high level steps:
- Setup an automated pipeline that triggers every time you commit your code
- Run your unit tests, code linting, static analysis etc
- Compile/build/package your code
- Run integration and end-to-end tests
- Artefact the package
- Deploy your package
Use a standard package and build manager
If you’re writing Java, that means maven or gradle. If you’re in JS land, npm or yarn. It’s worth the effort adopting or upgrading to the most widely adopted tool for your ecosystem. Yes, moving a Java project from ant to maven can be painful, but the standardisation is super important.
Pick a CI pipeline tool and lean on it hard
Choosing CircleCI, Gitlab CI, Drone or Github Actions (or something else!) is less important than choosing one at all. Setting up pipelines to run on every commit of your code is very easy to achieve and delivers a bundle of value to the overall process; even if it is just running your automated tests.
Docker is the perfect core platform to run these tools on top of. Some (like Gitlab Runner) allow you to target a machine environment itself, but this can easily lead to unpredictable builds as the builds are dependent on the machine that the builds are being run on. Stick with Docker as the CI runtime.
Artefact your Builds
Again, we choose docker images to artefact our builds. This is a perfect way to store a catalog of your releases.
Automate your Testing
What nirvana looks like? Every commit of your code is tested against a reproduction of your production environment. Unit, integration and end-to-end tests all happen automatically and reliably. Error reporting is precise and concise. Browser and devices simulated perfectly.
Thankfully we’ve progressed from the days of having a “test team”. Having a good, deep test suite is critically important to building confidence in the continuous deployment process. We test at multiple levels:
- Unit tests. This is both on the front end and the back end.
- Integration Tests. Again, both on the front end and back end.
- End To End Tests. These generally deliver the most value, catch the most bugs but are also the most brittle. We use chromedriver to run our automated end to end tests.
Our testing process also integrates with our CI process. All commits across all branches run our full test suite, providing immediate feedback to developers.
We don’t obsess about “code coverage”. Quantifying testing can be a dangerous game, giving you a false sense of security. Thinking qualitatively about your testing, and especially your end to end tests, will deliver the most value over time.
If you find parts of your testing are brittle or often throw up false positive errors, it’s worth investing the time trying to solve these problems. Ditto the speed of your tests. If you can test your code more quickly, you can reduce your cycle time and improve your overall velocity.
Automate your deployments
What nirvana looks like? Builds happen automatically, quickly and with zero downtime. Rolling back to previous releases is trivial and quick. Bonus points for having previous versions accessible from artefacted endpoints (e.g. build145.frontend.flagsmith.com)
Being able to reliably deploy your code with zero downtime means that, over time, you can forget about the process happening at all. This can be a tough nut to crack, and it will take time for your code and your team to gain trust in the process, but once it is set up you will never want to go back to hand-holding your builds.
Solving this problem can be extremely dependent on your infrastructure platform. Some, for example Vercel, Fly, Heroku or Google App Engine, have been designed from the start to offer this functionality right from the start. For example, deploying front end code to Vercel generally just requires you to point it at your git repository and the rest is done for you with zero code required.
If you are working from a legacy code base and infrastructure, this task can be a lot of work to get nailed. If that’s the case, here are some tips for things you can do to break down the work.
- If your application is not already containerised, we would recommend getting it running within Docker. This provides benefits both in terms of the build process, but can also give you more options with regards to the deployment story too. Once you are in docker you can be more flexible about where you want your container images to run.
- Make your application images as stateless as possible. This can hugely simplify deployments, rolling forward/back versions, blue green deployments and all that fun stuff.
- Put state into things that are dependable and well trusted, like Postgres and Redis. Where possible, lean on things like AWS RDS or Google CloudSQL to look after the (hard) stateful stuff.
- Build in meaningful health checks into your application that test things like database or API connections. Just because the web server is running doesn’t mean your application is!
Bringing it all together
That’s quite a lot to go over! Starting from scratch can be a daunting prospect, so try to take things one step at a time. Remember that you will unlock value at each point in the process.