On Thu, Oct 13, 2022 at 04:18:24AM +0200, Stefano Brivio wrote:Well, this drop_caps() is pretty much the same as patch 8/10, so it actually did something. :)Yes, but not what we wanted :).On Tue, 11 Oct 2022 16:40:15 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ah, yes, done.The current implementation of drop_caps() doesn't really work because it attempts to drop capabilities from the bounding set. hat's not the set that really matters: the bounding set is about limiting the abilities of otherwise things we might later exec() rather than our own capabilities. In addition altering the bounding set requires CAP_SETPCAP which we won't usually have. Replace it with a new version which uses setcap(2) to drop capabilities from the effective and permitted sets, which is what actually matters for most purposes. For now we leave the inheritable set alone, since we don't want to preclude the user from passing inheritable capabilities to the command spawed by pasta. Correctly dropping caps reveals that we actually need CAP_SYS_ADMIN within the userns we create/join in pasta mode, so that we can later setns() to the netns within it. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- isolation.c | 52 ++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 44 insertions(+), 8 deletions(-) diff --git a/isolation.c b/isolation.c index 4aa75e6..2468f84 100644 --- a/isolation.c +++ b/isolation.c @@ -86,18 +86,37 @@ #include "passt.h" #include "isolation.h" +#define CAP_VERSION _LINUX_CAPABILITY_VERSION_3 +#define CAP_WORDS _LINUX_CAPABILITY_U32S_3 + /** - * drop_caps() - Drop capabilities we might have except for CAP_NET_BIND_SERVICE + * drop_caps_ep_except() - Drop capabilities from effective & permitted sets + * @keep: Capabilities to keep */ -static void drop_caps(void) +static void drop_caps_ep_except(uint64_t keep) { + struct __user_cap_header_struct hdr = { + .version = CAP_VERSION, + .pid = 0, + }; + struct __user_cap_data_struct data[CAP_WORDS]; int i; - for (i = 0; i < 64; i++) { - if (i == CAP_NET_BIND_SERVICE) - continue; + if (syscall(SYS_capget, &hdr, data)) { + err("Couldn't get current capabilities: %s", strerror(errno)); + exit(EXIT_FAILURE); + } + + for (i = 0; i < CAP_WORDS; i++) { + uint32_t mask = keep >> (32 * i); + + data[i].effective &= mask; + data[i].permitted &= mask; + } - prctl(PR_CAPBSET_DROP, i, 0, 0, 0); + if (syscall(SYS_capset, &hdr, data)) { + err("Couldn't drop capabilities: %s", strerror(errno)); + exit(EXIT_FAILURE); } } @@ -111,7 +130,11 @@ static void drop_caps(void) */ void isolate_initial(void) { - drop_caps(); + /* We want to keep CAP_NET_BIND_SERVICE in the initial + * namespace if we have it, so that we can forward low ports + * into the guest/namespace + */ + drop_caps_ep_except((1UL << CAP_NET_BIND_SERVICE));You could use BIT() (util.h) here,-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson} /** @@ -211,6 +234,7 @@ void isolate_user(uid_t uid, gid_t gid, bool use_userns, const char *userns) int isolate_prefork(struct ctx *c) { int flags = CLONE_NEWIPC | CLONE_NEWNS | CLONE_NEWUTS; + uint64_t ns_caps = 0; /* If we run in foreground, we have no chance to actually move to a new * PID namespace. For passt, use CLONE_NEWPID anyway, in case somebody @@ -251,7 +275,19 @@ int isolate_prefork(struct ctx *c) return -errno; } - drop_caps(); /* Relative to the new user namespace this time. */ + /* Drop capabilites in our new userns */ + if (c->mode == MODE_PASTA) { + /* Keep CAP_SYS_ADMIN, so that we can setns() to the + * netns when we need to act upon it + */ + ns_caps |= 1UL << CAP_SYS_ADMIN;here,+ /* Keep CAP_NET_BIND_SERVICE, so we can splice + * outbound connections to low port numbers + */ + ns_caps |= 1UL << CAP_NET_BIND_SERVICE;and here.+ } + + drop_caps_ep_except(ns_caps); return 0; }