Background: Taboola's recommendation engine responds to hundreds of thousands of requests per second. The service has to be fast – so fast that its p95 should be below 500 milliseconds per request. Which means we can't have any downtime at all, or even afford slower responses.
In addition, it's critical to prevent the installation of a faulty version. A faulty version could lead to downtime or degraded performance, which can directly result in a loss of revenue. For this reason, we have multiple testing gateways during development -- to help prevent a bad version. However, based on our experience, sometimes when the software meets production, unexpected and often bad things can happen. We need to be ready to prevent that. Another important requirement is to deploy during office hours, when most of the engineers will be available to assist should something go wrong.
Goals: To deploy a highly sophisticated Java service, one that is very actively developed on a daily basis, to thousands of servers in multiple data centers around the world.
Solution & Results: To meet the objectives, we designed a flow for the deployment. The following are the flow stages at high level:
Jenkins pipelines made the implementation of a very complex flow easy.
The deployment procedure on a single data center goes like this:
Get the list of servers to be deployed
Calculate the size of the server batch (using metrics and math :)
For each server in the batch
Run a batch verification to check various metrics of the domain
Wait for a minute for the next server batch
Repeat until no servers are left
For reference, the flow is detailed at: https://engineering.taboola.com/high-scale-service-deployment/
All of the logic is implemented with Jenkins Pipelines and Groovy support. We created a large shared libs repository with our deployment flow infrastructure. It made the process easy to maintain, extend and generalize to other services as well. As for Jenkins Plugins, we use different plugins during the flow run to report metrics and alert. For example, we integrated the Pager Duty Plugin to trigger an alert in case of a failure. The alert is triggered and resolved automatically by code.
All in all, we saw great results, including: