Deploying on Kubernetes and OpenShift with OpenJ9 CRIU Support

The previous blog post showed how to restore an application that checkpointed itself using OpenJ9 CRIU Support in an unprivileged container. This blog post will go over how deploy these containers in Kubernetes (K8s) and OpenShift Container Platform (OCP).

Prerequisites

  • A kernel that supports the CAP_CHECKPOINT_RESTORE Linux capability. This capability was introduced in kernel version 5.9 but has been backported to RHEL kernel versions used in RHEL 8.6.
  • CRI-O configured to use crun or runc.
  • If using runc, the version needs to be 1.1.3 or higher to have the recent fix which enables mounting /proc/sys/kernel/ns_last_pid.
  • A container image with the checkpointed application.

Kubernetes

Clone the InstantOnStartupGuide repo

git clone https://github.com/ibmruntimes/InstantOnStartupGuide.git
cd InstantOnStartupGuide

Use an alias to for brevity.

alias kubectl-criu='kubectl --namespace=criu'

In line with K8s best practices, create the criu namespace so that the deployment can be appropriately scoped.

kubectl create -f YAMLs/k8s/criunamespace.yaml

In line with k8s best practices, create a service account in the criu namespace that will be used to run the deployment.

kubectl-criu apply -f YAMLs/k8s/criusvcacct.yaml

Launch the deployment. Note, the image field in my-app-criu.yaml should be updated with the name of the container image with the checkpointed application.

kubectl-criu apply -f YAMLs/common/my-app-criu.yaml

Inspect the logs to view the application output.

kubectl-criu get pods
kubectl-criu logs 

If you chose to try out the image built from this previous blog, you can view the output by doing

kubectl-criu exec  -- tail -f out

If you inspect the my-app-criu.yaml deployment file, you will see a similarities to the podman run command in the previous blog.

      containers:
        - name: my-app-criu
          image: 
          imagePullPolicy: Always
          volumeMounts:
          - mountPath: /proc/sys/kernel/ns_last_pid
            name: ns-last-pid-mount
          securityContext:
            capabilities:
              add: [ "CHECKPOINT_RESTORE", "NET_ADMIN", "SYS_PTRACE" ]
      volumes:
      - name: ns-last-pid-mount
        hostPath:
          path: /proc/sys/kernel/ns_last_pid
          type: File

There is still a need to specify the Linux capabilities in the Security Context, and /proc/sys/kernel/ns_last_pid needs to be specified as a Volume Mount if using runc or a kernel that does not have the clone3 system call. However, there is no need to specify the seccomp profile because by default, it is unconfined.

OpenShift Container Platform

Clone the InstantOnStartupGuide repo

git clone https://github.com/ibmruntimes/InstantOnStartupGuide.git
cd InstantOnStartupGuide

In line with OCP best practices, create a new project so that the deployment can be appropriately scoped.

oc new-project criu
oc project criu

In line with OCP best practices, create a new service account to run the deployment.

oc create sa criusvcacct

Create the appropriate Security Context Constraint (SCC) to allow a restore to occur with minimal privileges. This SCC is based on the restricted SCC. Additionally, create a new Role that uses this SCC.

oc apply -f YAMLs/ocp/scc-cap-cr.yaml
oc apply -f YAMLs/ocp/role-custom-scc-cap-cr-my-app-criu.yaml

Create a new Role Binding to bind the Role the Service Account.

oc apply -f YAMLs/ocp/rolebinding-criusvcacct-my-app-criu.yaml

Launch the deployment. Note, the image field in my-app-criu.yaml should be updated with the name of the container image with the checkpointed application.

oc apply -f YAMLs/common/my-app-criu.yaml

Inspect the logs to view the application output.

oc get pods
oc logs 

If you chose to try out the image built from this previous blog, you can view the output by doing

oc exec  -- tail -f out

At the time of the blog post, Red Hat CoreOS (4.11) does not have the necessary version of runc to allow mounting /proc/sys/kernel/ns_last_pid. There is a way to work around this, but it is not recommended on anything other than a sandbox environment. For each worker node:

  • Download the latest runc binary.
  • Create a bind mount by doing sudo mount --bind /Path/to/latest/runc /usr/bin/runc

This gets around the fact that /usr/bin/runc can’t be updated because it is mounted as read-only.

Privileged

Running as privileged means deploying the container in the most permissive mode possible. This is generally not recommended as it significantly weakens the security of a container. However, for completeness, the following briefly outlines deploying in privileged mode.

K8s

Deploying a privileged container in K8s is relatively straightforward. The template in the deployment file YAMLs/common/my-app-criu.yaml should be updated to be

  template:
    metadata:
      labels:
        name: my-app-criu
    spec:
      serviceAccount: criusvcacct
      serviceAccountName: criusvcacct
      containers:
        - name: my-app-criu
          image: 
          imagePullPolicy: Always
          securityContext:
            privileged: true

The main difference is to remove the Volume Mount, and to use privileged: true in the securityContext field.

OCP

Deploying a privileged container in OCP is a little more involved. First, the resourceNames field in the YAMLs/ocp/role-custom-scc-cap-cr-my-app-criu.yaml configuration should be updated to

  resourceNames:
  - privileged

Next, the YAMLs/common/my-app-criu.yaml should be updated with the changes described in the K8s section above.

OpenJ9