Make it possible to set noexec on volume mounts #361

jscissr · 2024-11-26T14:58:00Z

For security hardening, it is desirable to set readOnlyRootFilesystem on containers and set the nodev, noexec, nosuid mount flags on any volume mounts. This makes it impossible to create new executable files in a container. However, these flags must be optional as some application do need to execute files in volumes.

The local-strict storage class is supposed to set nodev, noexec, nosuid flags on persistent volume mounts, but this no longer works after updating runc to 1.2.0 or greater.

This change in runc is probably the reason for this: opencontainers/runc@a68529c With this change, non-zero ClearedFlags now triggers a MS_REMOUNT | MS_BIND, which clears all previously set per-mount-point flags (except the atime flags, which are preserved by the kernel). The mount options include "rw", so ClearedFlags contains MS_RDONLY. It looks like this feature of setting per-mount-point flags in the CSI server only ever worked by accident. Even before this change, making a VolumeMount readonly triggers a remount in runc, and the noexec, ... flags are lost.

If we look at how the PVs are mounted inside the container, we can see the following inside the OCI runtime config which is passed to runc:

    {
      "destination": "/vol/local-strict",
      "type": "bind",
      "source": "/data/kubernetes/kubelet/pods/ea9c191a-30e2-4031-b2a5-ee84a9d3e781/volumes/kubernetes.io~csi/pvc-b1a6cd84-70f4-4a0a-be81-01046f5eb2c0/mount",
      "options": [
        "rbind",
        "rprivate",
        "rw"
      ]
    },

The source path is the path on the host where our CSI server has mounted the PV. runc then bind-mounts this inside the container mount namespace at the destination path, with the given mount options. The noexec flags should actually be set here, not by the CSI server.

If we look further where the mount options in the OCI runtime config are generated, this is done by containerd:
https://github.com/containerd/containerd/blob/9dfdb242ff9e91527c252b6e50031369b5498ebb/internal/cri/opts/spec_linux_opts.go#L208
Only a fixed set of mount options can be generated, which does not include noexec.

The Kubernetes CRI API has no mount options field:
https://github.com/kubernetes/cri-api/blob/f9fb3fa0944524c7f434042c76aab34b8bf4ea03/pkg/apis/runtime/v1/api.proto#L221
And VolumeMount in the Kubernetes API also does not have that.

There is an open Kubernetes issue for adding noexec on emptyDir mounts: kubernetes/kubernetes#48912
The discussion there mentions that mount options are only supported on PVs, not on inline pod volumes, because of security concerns with allowing users to set arbitrary mount options. But there are actually two different kinds of mount options in Linux: superblock mount options, and per-mount-point flags. The per-mount-point flags in current Linux versions are: nosuid, nodev, noexec, noatime, nodiratime, relatime, strictatime (API-only), readonly, nosymfollow. Allowing users to set these flags does not cause security issues. Because these flags can be different for each mount point, it makes more sense conceptually to set these flags on the VolumeMount, which corresponds to a mount point, instead of on the volume itself. In fact, one of these flags can already be set on the VolumeMount: the readonly flag, through the ReadOnly boolean field.

I think a good solution would be to add a MountOptions: []string field to VolumeMount in the Kubernetes API and to Mount in the CRI API. These would only allow the per-mount-point flags. This field would then also need to be implemented in CRI servers like containerd, which would forward it to the OCI runtime config. This would be a general solution for all types of volumes, not just for PVs or emptyDir. However, this is a lot of work, and we currently have other priorities. For now, we will just remove the local-strict storage class.

The text was updated successfully, but these errors were encountered:

It turns out that the local-strict storage class did not have an effect on readonly volumes, or on gVisor. And after updating runc to 1.2.0, it no longer has an effect anywhere. It appears that setting noexec and similar flags in the CSI server, using a storage class, is the wrong approach and just happened to work by accident. Instead, this should probably be implemented as a Kubernetes feature to set per-mount-point flags on the VolumeMount. This commit thus removes the local-strict storage class and the mount options processing in the provisioner and CSI server. This will allow updating runc. Additionally, the StatefulSet end-to-end test is extended to also run tests with gVisor. gVisor apparently does not support block volumes. See: #361 Change-Id: Ic2f50aa3bc9442ca1dbb9e8742d5b8fecbfc3614 Reviewed-on: https://review.monogon.dev/c/monogon/+/3658 Tested-by: Jenkins CI Reviewed-by: Lorenz Brun <lorenz@monogon.tech>

jscissr added enhancement New feature or request c/k8s Kubernetes component security labels Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it possible to set noexec on volume mounts #361

Make it possible to set noexec on volume mounts #361

jscissr commented Nov 26, 2024

Make it possible to set noexec on volume mounts #361

Make it possible to set noexec on volume mounts #361

Comments

jscissr commented Nov 26, 2024