Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to set noexec on volume mounts #361

Open
jscissr opened this issue Nov 26, 2024 · 0 comments
Open

Make it possible to set noexec on volume mounts #361

jscissr opened this issue Nov 26, 2024 · 0 comments
Labels
c/k8s Kubernetes component enhancement New feature or request security

Comments

@jscissr
Copy link
Contributor

jscissr commented Nov 26, 2024

For security hardening, it is desirable to set readOnlyRootFilesystem on containers and set the nodev, noexec, nosuid mount flags on any volume mounts. This makes it impossible to create new executable files in a container. However, these flags must be optional as some application do need to execute files in volumes.

The local-strict storage class is supposed to set nodev, noexec, nosuid flags on persistent volume mounts, but this no longer works after updating runc to 1.2.0 or greater.

This change in runc is probably the reason for this: opencontainers/runc@a68529c With this change, non-zero ClearedFlags now triggers a MS_REMOUNT | MS_BIND, which clears all previously set per-mount-point flags (except the atime flags, which are preserved by the kernel). The mount options include "rw", so ClearedFlags contains MS_RDONLY. It looks like this feature of setting per-mount-point flags in the CSI server only ever worked by accident. Even before this change, making a VolumeMount readonly triggers a remount in runc, and the noexec, ... flags are lost.

If we look at how the PVs are mounted inside the container, we can see the following inside the OCI runtime config which is passed to runc:

    {
      "destination": "/vol/local-strict",
      "type": "bind",
      "source": "/data/kubernetes/kubelet/pods/ea9c191a-30e2-4031-b2a5-ee84a9d3e781/volumes/kubernetes.io~csi/pvc-b1a6cd84-70f4-4a0a-be81-01046f5eb2c0/mount",
      "options": [
        "rbind",
        "rprivate",
        "rw"
      ]
    },

The source path is the path on the host where our CSI server has mounted the PV. runc then bind-mounts this inside the container mount namespace at the destination path, with the given mount options. The noexec flags should actually be set here, not by the CSI server.

If we look further where the mount options in the OCI runtime config are generated, this is done by containerd:
https://github.com/containerd/containerd/blob/9dfdb242ff9e91527c252b6e50031369b5498ebb/internal/cri/opts/spec_linux_opts.go#L208
Only a fixed set of mount options can be generated, which does not include noexec.

The Kubernetes CRI API has no mount options field:
https://github.com/kubernetes/cri-api/blob/f9fb3fa0944524c7f434042c76aab34b8bf4ea03/pkg/apis/runtime/v1/api.proto#L221
And VolumeMount in the Kubernetes API also does not have that.

There is an open Kubernetes issue for adding noexec on emptyDir mounts: kubernetes/kubernetes#48912
The discussion there mentions that mount options are only supported on PVs, not on inline pod volumes, because of security concerns with allowing users to set arbitrary mount options. But there are actually two different kinds of mount options in Linux: superblock mount options, and per-mount-point flags. The per-mount-point flags in current Linux versions are: nosuid, nodev, noexec, noatime, nodiratime, relatime, strictatime (API-only), readonly, nosymfollow. Allowing users to set these flags does not cause security issues. Because these flags can be different for each mount point, it makes more sense conceptually to set these flags on the VolumeMount, which corresponds to a mount point, instead of on the volume itself. In fact, one of these flags can already be set on the VolumeMount: the readonly flag, through the ReadOnly boolean field.

I think a good solution would be to add a MountOptions: []string field to VolumeMount in the Kubernetes API and to Mount in the CRI API. These would only allow the per-mount-point flags. This field would then also need to be implemented in CRI servers like containerd, which would forward it to the OCI runtime config. This would be a general solution for all types of volumes, not just for PVs or emptyDir. However, this is a lot of work, and we currently have other priorities. For now, we will just remove the local-strict storage class.

@jscissr jscissr added enhancement New feature or request c/k8s Kubernetes component security labels Nov 26, 2024
monogon-bot pushed a commit that referenced this issue Nov 27, 2024
It turns out that the local-strict storage class did not have an effect
on readonly volumes, or on gVisor. And after updating runc to 1.2.0, it
no longer has an effect anywhere. It appears that setting noexec and
similar flags in the CSI server, using a storage class, is the wrong
approach and just happened to work by accident. Instead, this should
probably be implemented as a Kubernetes feature to set per-mount-point
flags on the VolumeMount.

This commit thus removes the local-strict storage class and the mount
options processing in the provisioner and CSI server. This will allow
updating runc.

Additionally, the StatefulSet end-to-end test is extended to also run
tests with gVisor. gVisor apparently does not support block volumes.

See: #361
Change-Id: Ic2f50aa3bc9442ca1dbb9e8742d5b8fecbfc3614
Reviewed-on: https://review.monogon.dev/c/monogon/+/3658
Tested-by: Jenkins CI
Reviewed-by: Lorenz Brun <lorenz@monogon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/k8s Kubernetes component enhancement New feature or request security
Projects
None yet
Development

No branches or pull requests

1 participant