You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For security hardening, it is desirable to set readOnlyRootFilesystem on containers and set the nodev, noexec, nosuid mount flags on any volume mounts. This makes it impossible to create new executable files in a container. However, these flags must be optional as some application do need to execute files in volumes.
The local-strict storage class is supposed to set nodev, noexec, nosuid flags on persistent volume mounts, but this no longer works after updating runc to 1.2.0 or greater.
This change in runc is probably the reason for this: opencontainers/runc@a68529c With this change, non-zero ClearedFlags now triggers a MS_REMOUNT | MS_BIND, which clears all previously set per-mount-point flags (except the atime flags, which are preserved by the kernel). The mount options include "rw", so ClearedFlags contains MS_RDONLY. It looks like this feature of setting per-mount-point flags in the CSI server only ever worked by accident. Even before this change, making a VolumeMount readonly triggers a remount in runc, and the noexec, ... flags are lost.
If we look at how the PVs are mounted inside the container, we can see the following inside the OCI runtime config which is passed to runc:
The source path is the path on the host where our CSI server has mounted the PV. runc then bind-mounts this inside the container mount namespace at the destination path, with the given mount options. The noexec flags should actually be set here, not by the CSI server.
There is an open Kubernetes issue for adding noexec on emptyDir mounts: kubernetes/kubernetes#48912
The discussion there mentions that mount options are only supported on PVs, not on inline pod volumes, because of security concerns with allowing users to set arbitrary mount options. But there are actually two different kinds of mount options in Linux: superblock mount options, and per-mount-point flags. The per-mount-point flags in current Linux versions are: nosuid, nodev, noexec, noatime, nodiratime, relatime, strictatime (API-only), readonly, nosymfollow. Allowing users to set these flags does not cause security issues. Because these flags can be different for each mount point, it makes more sense conceptually to set these flags on the VolumeMount, which corresponds to a mount point, instead of on the volume itself. In fact, one of these flags can already be set on the VolumeMount: the readonly flag, through the ReadOnly boolean field.
I think a good solution would be to add a MountOptions: []string field to VolumeMount in the Kubernetes API and to Mount in the CRI API. These would only allow the per-mount-point flags. This field would then also need to be implemented in CRI servers like containerd, which would forward it to the OCI runtime config. This would be a general solution for all types of volumes, not just for PVs or emptyDir. However, this is a lot of work, and we currently have other priorities. For now, we will just remove the local-strict storage class.
The text was updated successfully, but these errors were encountered:
It turns out that the local-strict storage class did not have an effect
on readonly volumes, or on gVisor. And after updating runc to 1.2.0, it
no longer has an effect anywhere. It appears that setting noexec and
similar flags in the CSI server, using a storage class, is the wrong
approach and just happened to work by accident. Instead, this should
probably be implemented as a Kubernetes feature to set per-mount-point
flags on the VolumeMount.
This commit thus removes the local-strict storage class and the mount
options processing in the provisioner and CSI server. This will allow
updating runc.
Additionally, the StatefulSet end-to-end test is extended to also run
tests with gVisor. gVisor apparently does not support block volumes.
See: #361
Change-Id: Ic2f50aa3bc9442ca1dbb9e8742d5b8fecbfc3614
Reviewed-on: https://review.monogon.dev/c/monogon/+/3658
Tested-by: Jenkins CI
Reviewed-by: Lorenz Brun <lorenz@monogon.tech>
For security hardening, it is desirable to set readOnlyRootFilesystem on containers and set the nodev, noexec, nosuid mount flags on any volume mounts. This makes it impossible to create new executable files in a container. However, these flags must be optional as some application do need to execute files in volumes.
The local-strict storage class is supposed to set nodev, noexec, nosuid flags on persistent volume mounts, but this no longer works after updating runc to 1.2.0 or greater.
This change in runc is probably the reason for this: opencontainers/runc@a68529c With this change, non-zero ClearedFlags now triggers a
MS_REMOUNT | MS_BIND
, which clears all previously set per-mount-point flags (except the atime flags, which are preserved by the kernel). The mount options include "rw", so ClearedFlags containsMS_RDONLY
. It looks like this feature of setting per-mount-point flags in the CSI server only ever worked by accident. Even before this change, making a VolumeMount readonly triggers a remount in runc, and the noexec, ... flags are lost.If we look at how the PVs are mounted inside the container, we can see the following inside the OCI runtime config which is passed to runc:
The source path is the path on the host where our CSI server has mounted the PV. runc then bind-mounts this inside the container mount namespace at the destination path, with the given mount options. The noexec flags should actually be set here, not by the CSI server.
If we look further where the mount options in the OCI runtime config are generated, this is done by containerd:
https://github.com/containerd/containerd/blob/9dfdb242ff9e91527c252b6e50031369b5498ebb/internal/cri/opts/spec_linux_opts.go#L208
Only a fixed set of mount options can be generated, which does not include noexec.
The Kubernetes CRI API has no mount options field:
https://github.com/kubernetes/cri-api/blob/f9fb3fa0944524c7f434042c76aab34b8bf4ea03/pkg/apis/runtime/v1/api.proto#L221
And VolumeMount in the Kubernetes API also does not have that.
There is an open Kubernetes issue for adding noexec on emptyDir mounts: kubernetes/kubernetes#48912
The discussion there mentions that mount options are only supported on PVs, not on inline pod volumes, because of security concerns with allowing users to set arbitrary mount options. But there are actually two different kinds of mount options in Linux: superblock mount options, and per-mount-point flags. The per-mount-point flags in current Linux versions are: nosuid, nodev, noexec, noatime, nodiratime, relatime, strictatime (API-only), readonly, nosymfollow. Allowing users to set these flags does not cause security issues. Because these flags can be different for each mount point, it makes more sense conceptually to set these flags on the VolumeMount, which corresponds to a mount point, instead of on the volume itself. In fact, one of these flags can already be set on the VolumeMount: the readonly flag, through the ReadOnly boolean field.
I think a good solution would be to add a
MountOptions: []string
field to VolumeMount in the Kubernetes API and to Mount in the CRI API. These would only allow the per-mount-point flags. This field would then also need to be implemented in CRI servers like containerd, which would forward it to the OCI runtime config. This would be a general solution for all types of volumes, not just for PVs or emptyDir. However, this is a lot of work, and we currently have other priorities. For now, we will just remove the local-strict storage class.The text was updated successfully, but these errors were encountered: