In the previous part, I’ve shown how to set up a simple swarm cluster with 1 master and 2 worker nodes. In this part, I’m gonna continue with how to configure it to work with gitlab and also achieve zero downtime deployment.

gitlab

Gitlab in a nutshell is a complete devops platform from project planning, source code management to CI/CD and monitoring. You can either host your own gitlab and use gitlab.com. In this blog post, I’m gonna use gitlab.com but everything here should be still applicable to your self-hosted gitlab.

Before we go into details, I need to go through few concepts with Gitlab CI/CD. In gitlab, each project can configure a list of runners for that project (if you have a group, you can configure it at the group level). Runners will execute your build pipelines. You can have 1 or several runners (of the same or different types) per each of your server (I usually have 1 per server and change the concurrency level accordingly). By default, each project will have a shared pool of runners but you are free to set up your own runners and assign them to your projects. To keep it simple, I’m gonna use the shared runners. Each shared runner executes the build pipeline inside a docker container.

Gitlab also comes with a free docker registry which can come in handy when you need to host your own (private) images. The build process usually produces docker images that can be used to deployed into your production server. And there are many other things, but they are outside the scope of this blog post, so I won’t cover them here.

Similar to other CI/CD services, you will need to have a config file with instructions to let the service know how to build your project, in gitlab it’s .gitlab-ci.yml. Let’s get started with the gitlab config file

image: docker:19.03.0

variables:
  DOCKER_DRIVER: overlay2
  DOCKER_TLS_CERTDIR: ""

services:
  - docker:19.03.0-dind

stages:
  - Deploy image

docker deploy:
  stage: Deploy image
  only:
    - master
  script:
    - export DOCKER_TLS_VERIFY="1"
    - export DOCKER_HOST="$DOCKER_SWARM_MASTER"
    - export DOCKER_CERT_PATH="certs"
    - mkdir $DOCKER_CERT_PATH
    - echo "$DOCKER_MACHINE_CA" > $DOCKER_CERT_PATH/ca.pem
    - echo "$DOCKER_MACHINE_CLIENT_CERT" > $DOCKER_CERT_PATH/cert.pem
    - echo "$DOCKER_MACHINE_CLIENT_KEY" > $DOCKER_CERT_PATH/key.pem
    - docker stack deploy -c production.yml whoami
    - rm -rf $DOCKER_CERT_PATH

In this simple project, I only have 1 stage which is to deploy. In real projects, we will have a lot more, for example, build and run multiple tests etc… Since I’m going to deploy to my own swarm cluster, I need to access to docker (docker-in-docker), that’s why I have to set the image to docker:19.03.0 and include docker:19.03.0-dind as a service. There are several ways to build docker images with gitlab. This is the recommended way, so I will go with it.

In order to connect to my swarm cluster, I need to tell the runner how/where via some environment variables

  • DOCKER_TLS_VERIFY: enable TLS verification, it’s off by default
  • DOCKER_HOST: the URL to the docker server, it’s usually tcp://ip-address:port
  • DOCKER_CERT_PATH: where to look for the certificates required to connect to the foreign docker server. It’s default to ~/.docker, but I don’t like to use the default path, hence the custom certs path

Then we need to generate proper keys to authenticate ourselves. The easier way is to just copy the keys generated by docker-machine when you provision the server with docker-machine. Or follow this guide to generate new ones. To keep it simple, I will copy my docker-machine keys. They are located in ~/.docker/machine/certs

You probably don’t want to store all the secrets/keys in your repository. Gitlab has a feature for you to set environment variables to be injected to your build pipelines. It’s still not the most secured thing, but still a lot better than storing all the sensitive stuff in your repository

And then I can simply run docker stack deploy -c production.yml whoami to deploy a new service to my swarm cluster. Then, at the end of my deploy step, always remove certs from the build machine just in case. Here is my product.yml

version: "3.7"

services:
  hello:
    image: jwilder/whoami:latest
    networks:
      - flix
    deploy:
      replicas: 2
      labels:
        - "traefik.enable=true"
        - "traefik.http.services.whoami.loadbalancer.server.port=8000"
        - "traefik.http.routers.whoami.rule=Host(`whoami.tannguyen.org`)"
        - "traefik.http.routers.whoami.entrypoints=web-secured"
        - "traefik.http.routers.whoami.tls.certresolver=mytlschallenge"

networks:
  flix:
    external: true

The only new thing here is tls option, it’s there because I have Let’s Encrypt enabled for my domain, so everything must be https. flix is my internal ingress network. And here is an example of my build.

And here is the live version of whoami

Zero downtime deployment

Now that we have simple pipeline, let’s see if there is any zero downtime deployment built-in already. In order to test this, I will simply repeat a curl command to talk to whoami while it’s being deployed. I won’t use anything fancy, just a simple bash command

bash -c 'while [ 0 ]; do curl -s -o /dev/null -w "%{http_code} : " https://whoami.tannguyen.org; date +%H:%M:%S; sleep 1; done'

And here is what happens during the deployment. My requests were interrupted twice during the deployment which means that we are losing 2 of those requests. It’s gonna be a disaster if you have thousands of requests per second and each time you deploy something, half (or even more) of those requests are dropped.

So, what’s wrong? There are several things

  1. We don’t have a health check mechanism to let swarm know that our container is alive and healthly
  2. We need to tell the load balancers (there are 2 of them!) how to handle the deployment process. As I mentioned, there are 2 load balancers, one is traefik and the other one is the internal ingress load balancing that swarm provides. We can’t use both, need to pick one, I’m gonna pick swarm because that way I can freely switch to another reverse proxy in the future without changing my setup

Switch to swarm load balancer

This is what traefik shows when we use its load balancer

There are 2 servers to serve whoami.tannguyen.org. We need to add this new label "traefik.docker.lbswarm=true" to let traefik know that it should use swarm load balancer. After redeploying, here is what it shows

There is only 1 server now and that server is the swarm cluster.

Add some health check

Normally I would define this in the Dockerfile when I build my services but here we use a 3rd party image, so I have to do it in the compose file. Add this to the deployment file, this checks if locallhost:8000 is accessible via wget, you can use anything, I often use alpine as the base image and wget is included.

healthcheck:
  test: "wget --quiet --tries=1 --spider http://localhost:8000 || exit 1"
  interval: "60s"
  timeout: "3s"
  start_period: "5s"
  retries: 3

The main purpose of the healthcheck argument is to let docker know when your container is up and ready to serve. For example this is what it looks like when you do docker ps

Tell swarm how we want it to do rolling updates

Let’s start with the actual configuration (this needs to be added inside the deploy block)

update_config:
  parallelism: 1
  order: start-first
  failure_action: rollback
  delay: 10s
rollback_config:
  parallelism: 0
  order: stop-first

The update-config tells swarm that we want to update only 1 container at a time, and in start-first order which means to start new container first, check if it’s healthy and shut down the old one (the default is stop-first which is the reverse). And delay is the time between each update.

Then rollback_config tells swarm to roll back everything at once, and in stop-first order.

This is a very simple and naive configuration to get the job done, it totally depends on how many containers you have and how fast you want to roll out the new version.

And here is the full configuration file

version: "3.7"

services:
  hello:
    image: jwilder/whoami:latest
    networks:
      - flix
    healthcheck:
      test: "wget --quiet --tries=1 --spider http://localhost:8000 || exit 1"
      interval: "60s"
      timeout: "3s"
      start_period: "5s"
      retries: 3
    deploy:
      replicas: 2
      update_config:
        parallelism: 1
        order: start-first
        failure_action: rollback
        delay: 10s
      rollback_config:
        parallelism: 0
        order: stop-first
      labels:
        - "traefik.enable=true"
        - "traefik.docker.lbswarm=true"
        - "traefik.http.services.whoami.loadbalancer.server.port=8000"
        - "traefik.http.routers.whoami.rule=Host(`whoami.tannguyen.org`)"
        - "traefik.http.routers.whoami.entrypoints=web-secured"
        - "traefik.http.routers.whoami.tls.certresolver=mytlschallenge"

networks:
  flix:
    external: true

If you do docker ps you can see that the whoami container is marked as healthy.

Now there won’t be any interruption when you deploy a new version, everything should be all 200. However, the old container will still need to handle shutdown properly and gracefully to not drop any job half done or suddenly drop any requests. But that’s implementation details and depends a lot on what you want to do with your containers.

And that’s it! with this we can have a simple swarm cluster!

I find swarm very easy to get started with, it took me about 2,3 hours to get my first cluster started (from zero knowledge about swarm, I didn’t even know it existed). Although, swarm is considered inferior to kubernetes, I think they serve different purposes. If I just need to manage 100 containers, I will just set up a simple swarm cluster to do that. For bigger scale, I will probably go with kubernetes. But everyone seems to pick up and talk about kubernetes, so I just wanted to try something else, never a big fan of “the only solution” mojo, I am more on the side of “use the right tool for the right job”.