Automating Non-Non-Downtime Upgrades in Kubernetes with ArgoCD

I recently worked on a project to move a complicated legacy application onto Kubernetes. It was quite an undertaking, but in the end we were successful. One of the biggest challenges was figuring out how to automate our legacy deployment process, one where the whole application has to be stopped completely for schema upgrades to run.

The normal “Kubernetes way” to upgrade an application is by changing the Deployment resource. With its default RollingUpdate strategy it will delete a pod with the old definition, start a pod with the new definition, wait for it to be healthy, then repeat continuously until the change is fully applied.

This process wouldn’t work for us, and it was not obvious how we could automate one that did. Our application is tied to its schema version; new versions of the app won’t run on the old schema, old versions of the app can’t run on the new schema, and the schema migrator won’t start if it detects any running applications. We would have preferred to use a rolling update without downtime, but it wasn’t possible to make our application support this in our timelines. I expect it will eventually be implemented, but it will require several changes and a significant testing effort.

The process we wanted to automate was:

Shut down the old version of the application
Run the schema migrator
Start the new version of the application

Or putting it into Kubernetes terms:

Delete all the pods
Run the schema migrator job
Create new pods (with new image tag)

We tried a few different approaches, but the solution we ultimately chose was using ArgoCD with its sync phases and waves feature. There were a few unexpected challenges, but we were still happy with the results.

What is ArgoCD and What Are Sync Phases and Waves?

ArgoCD is a powerful open source tool that lets you deploy Helm charts to a Kubernetes cluster. The charts and their settings are pulled from a configurable source, in our case GitHub. This allowed us to store all our Kubernetes configuration as code. We wanted better visibility and consistency in our infrastructure and this tool makes that possible. Being able to add “GitOps” is an added bonus.

ArgoCD is based on a custom resource called an Application. An application represents a single installation of a Helm chart. Its resource includes a source (where to retrieve the chart), a destination (where to install the chart), and any parameters to apply to the chart. You can automate more complicated scenarios with charts containing Applications, resulting in Applications containing Applications. You can also use the ApplicationSet resource to generate multiple Application objects. We combine all of these to push out a lot of similar but not exactly the same applications in several environments.

ArgoCD periodically compares the source (code) and destination (cluster state). If there are any differences, the application gets marked as out of sync. ArgoCD can synchronize all the changes in an application automatically or with the press of a button. It also has a nice user interface that shows all the applications, their state, and some other useful information.

If your application can be deployed all at once as a simple Helm chart, ArgoCD can easily do this. For a more complicated deployment process like ours, we used the sync phases and waves feature; we added special annotations in a few our resource definitions to control the order ArgoCD applies their changes. It looks like this:

metadata:
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/sync-wave: "5"

How We Automated our Deployment Process

Our process uses these steps:

PreSync / -1 - Create necessary secrets, service accounts, etc.
PreSync / 10 - Create job: run a script that sets the replica count to 0
PreSync / 20 - Create job: run the schema migrator docker image
Sync (default) - Create the deployment, services, and everything else

The PreSync / 10 step ensures the application is stopped before continuing. It checks that the deployment exists so it won’t fail on the very first run, then it sets the replica count to 0. The pods get deleted pretty quickly after this change is applied.

The schema migrator job runs next. It can upgrade the schema of an existing database or create a new one if one doesn’t already exists. Once it completes, all the rest of the resources are created in the Sync step. The deployment resource sets the new application version and restores the replica count. Pods start getting created, and pretty soon we have a fully working application.

If the schema migrator job fails, Kubernetes will execute any retries as per the job definition. If the job can’t complete, the synchronization cycle stops and gets marked as failed in ArgoCD. A human can then make any necessary changes and trigger another synchronization cycle.

We ran this process hundreds of times in our development environment and several more in our production environments. The ArgoCD part of it always worked correctly. Since some of our applications had an installation per tenant, we also ran several in parallel with no issues. We did have a few deployments fail, but they were all caused by infrastructure issues or application bugs. That won’t be different from any other Kubernetes system.

Branching Strategy

The GitOps approach to managing our environments brought some significant benefits, but it also made it more challenging to test changes to our charts. For example, some application changes would require add or dropping startup parameters in the pod definition. We had to be especially careful that it wouldn’t break an environment if the new chart was applied before the application version was updated. It was possible to deal with simple changes like this using conditional blocks in the Helm templates, but it got a lot more messy when we were updating community charts for our logging or monitoring infrastructure, or changing the shared ingress definitions.

To ensure we didn’t have any accidents we moved to a branching strategy. We now use three branches:

develop - for our development environment. This is where most of our chart changes occur. It’s also where we test daily builds of our applications
staging - for our staging environment. We use this to test new charts and applications before a production release
production - for all of our production environments

Changes to the charts get tested in the development environment. They then get merged to the staging branch just before a major release. As part of the production release cycle we merge the same changes from the staging branch to the production branch, making sure that only tested changes get deployed.

To ensure there is no configuration drift, we also occasionally merge changes from the staging and production branches back to the development branch.

We did find branching a bit difficult to use, especially for parts of our team that had less experience with Git. Even with this difficulty, we found the added safety worthwhile.

The Good: Our Upgrade Process

Our upgrade run list was beautifully simple:

Make sure the environment is healthy and all the Application resources are in a healthy state
Merge any changes from the previous branch to the target branch (ex: develop to staging, or staging to production).
For each installation, modify the version (docker image tag) in the appropriate configuration yaml files and merge those.
Find the applications marked as out of sync in ArgoCD and trigger synchronization cycles. Wait for them to finish.

Once we started the synchronization cycle, ArgoCD would start applying the changes. A few minutes later the new version would start up and the web services would start responding to requests again.

The Bad: Scaling Pods Without Downtime

The biggest drawback of this approach was the inability to make minor configuration tweaks to our production system through the code. ArgoCD uses a synchronization cycle to apply changes from the source. This was great when we were changing the version of the application and the schema migrator needed to run, but it wasn’t so great when we needed to add a little memory or increase the replica count to keep everything working smoothly.

In these cases we had to make changes to the Kubernetes resources directly, bypassing ArgoCD. This meant the Application resources would be marked as out of sync until we made the same changes in the code. If we forgot this step, the changes would get stomped during the next synchronization cycle.

ArgoCD has a feature to ignore certain state differences in a resource. This is great when you’re using autoscaling or other Kubernetes automation features. We couldn’t use it though because it also sometimes prevented ArgoCD from applying the changes to raise the replica count from 0 in the Sync phase.

This is only an issue when deploying applications that need downtime while the chart is being applied. Many applications that are designed to run it Kubernetes will keep working throughout this process, and ArgoCD can handle this just fine.

The Ugly: Automatic Synchronization

ArgoCD is capable of triggering synchronization cycles automatically when it detects changes. This is helpful if you want to ensure your cluster’s state is always identical to your committed code, which is the ultimate in a GitOps workflow. The drawback is that it also means a synchronization cycle can be triggered whenever changes are detected. Since our process involves downtime, we didn’t want this to happen unintentionally in our production environments.

The other problem we ran into with automatic synchronization was that it made it harder to test minor configuration changes in our development environment. If we added or removed a bit of memory to measure the impact, ArgoCD would quickly reset it back. We could add parameters to allow these things to be configured, but that increased the complexity of the charts and made them harder to read. It also meant we had to remember te remove the same changes again later.

The setting for automatic synchronization is specified on a per-application basis via the resource definition, so you can make this behavior optional for some of your applications. We decided to use manual synchronization in our staging and production environments for everything but the top-level charts. This allowed us to control when changes were applied.

Conclusion

Using ArgoCD for automating the more complicated upgrade process worked well for us, even with its challenges. I would recommend this solution to others.

Another strong reason to use ArgoCD is that it is an excellent tool to use even if you don’t need to control the synchronization process. It was a great platform for us to deploy newer Kubernetes-native applications, and it was convenient to use the same tool for everything. It also left us in a position where we could iterate gradually to a simpler deployment process with our legacy applications.