[PATCH 3/3] tcp, udp: Bind outbound listening sockets by interface instead of address
Currently, outbound forwards (-T, -U) are handled by sockets bound to the
loopback address. Typically we create two sockets, one for 127.0.0.1 and
one for ::1.
This has some disadvantages:
* The guest can't connect to these services using its global IP address,
it must explicitly use 127.0.0.1 or ::1 (bug 100)
* The guest can't even connect via 127.0.0.0/8 addresses other than
127.0.0.1
* We can't use dual-stack sockets, we have to have separate sockets for
IPv4 and IPv6.
The restriction exist for a reason though. If the guest has any interfaces
other than pasta (e.g. a VPN tunnel) external hosts could reach the host
via the forwards. Especially combined with -T auto / -U auto this would
make it very easy to make a mistake with nasty security implications.
We can achieve both goals, however, if we don't bind the outbound listening
sockets to a particular address, but _do_ use SO_BINDTODEVICE to restrict
them to the "lo" interface.
Link: https://bugs.passt.top/show_bug.cgi?id=100
Signed-off-by: David Gibson
On Fri, 17 Oct 2025 11:34:47 +1100
David Gibson
Currently, outbound forwards (-T, -U) are handled by sockets bound to the loopback address. Typically we create two sockets, one for 127.0.0.1 and one for ::1.
This has some disadvantages: * The guest can't connect to these services using its global IP address, it must explicitly use 127.0.0.1 or ::1 (bug 100) * The guest can't even connect via 127.0.0.0/8 addresses other than 127.0.0.1 * We can't use dual-stack sockets, we have to have separate sockets for IPv4 and IPv6.
The restriction exist for a reason though. If the guest has any interfaces other than pasta (e.g. a VPN tunnel) external hosts could reach the host via the forwards. Especially combined with -T auto / -U auto this would make it very easy to make a mistake with nasty security implications.
We can achieve both goals, however, if we don't bind the outbound listening sockets to a particular address, but _do_ use SO_BINDTODEVICE to restrict them to the "lo" interface.
Nice trick, I didn't think of it. I wonder if doing the same host-side might help solving a part of https://bugs.passt.top/show_bug.cgi?id=113 as well.
Link: https://bugs.passt.top/show_bug.cgi?id=100
Signed-off-by: David Gibson
--- pif.c | 6 ------ tcp.c | 18 ++---------------- udp.c | 27 ++++++++++----------------- 3 files changed, 12 insertions(+), 39 deletions(-) diff --git a/pif.c b/pif.c index 592fafaa..84e3ceae 100644 --- a/pif.c +++ b/pif.c @@ -87,12 +87,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
ASSERT(pif_is_socket(pif));
- if (pif == PIF_SPLICE) { - /* Sanity checks */ - ASSERT(!ifname); - ASSERT(addr && inany_is_loopback(addr)); - } - if (!addr) return sock_l4_sa(c, type, &sa, sizeof(sa.sa6), ifname, false, data); diff --git a/tcp.c b/tcp.c index 15c012d7..982c9190 100644 --- a/tcp.c +++ b/tcp.c @@ -2592,20 +2592,6 @@ int tcp_sock_init(const struct ctx *c, uint8_t pif,
return r4 < 0 ? r4 : r6; } -/** - * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections - * @c: Execution context - * @port: Port, host order - */ -static void tcp_ns_sock_init(const struct ctx *c, in_port_t port) -{ - ASSERT(!c->no_tcp); - - if (c->ifi4) - tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port); - if (c->ifi6) - tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port); -}
/** * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections @@ -2625,7 +2611,7 @@ static int tcp_ns_socks_init(void *arg) if (!bitmap_isset(c->tcp.fwd_out.map, port)) continue;
- tcp_ns_sock_init(c, port); + tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
I thought the "lo" string would be part of the Linux UAPI, but that's not the case, and loopback_net_init() just calls: alloc_netdev(0, "lo", NET_NAME_PREDICTABLE, loopback_setup); so I think it's relatively unproblematic to hardcode that as well, and it looks like we can't create a second loopback interface, even though: $ pasta -- sh -c 'ip link set dev lo down; ip link change dev lo name lol; ip link show lol' 1: lol: <LOOPBACK> mtu 65536 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 I don't have any quick solution and I don't think we care enough as to write a function in netlink.c fetching links with loopback type, so I'm totally fine with this as it is. By the way, if we fail to use SO_BINDTODEVICE, we already defensively close the socket. The only possible flaw that occurs to me is that somebody could rename 'lo' and then create a link called 'lo' of a different type. But that needs CAP_NET_ADMIN in the container anyway.
}
return 0; @@ -2805,7 +2791,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound) if ((c->ifi4 && socks[port][V4] == -1) || (c->ifi6 && socks[port][V6] == -1)) { if (outbound) - tcp_ns_sock_init(c, port); + tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
Should we have/keep a fallback for pre-5.7 / pre-c427bfec18f2 kernels?
else tcp_sock_init(c, PIF_HOST, NULL, NULL, port); } diff --git a/udp.c b/udp.c index 49dd0144..e38114eb 100644 --- a/udp.c +++ b/udp.c @@ -1127,26 +1127,16 @@ int udp_sock_init(const struct ctx *c, uint8_t pif, }
if ((!addr || inany_v4(addr)) && c->ifi4) { - const union inany_addr *a = addr ? - addr : &inany_any4; - - if (pif == PIF_SPLICE) - a = &inany_loopback4; - - r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname, + r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, + addr ? addr : &inany_any4, ifname, port, uref.u32);
socks[V4][port] = r4 < 0 ? -1 : r4; }
if ((!addr || !inany_v4(addr)) && c->ifi6) { - const union inany_addr *a = addr ? - addr : &inany_any6; - - if (pif == PIF_SPLICE) - a = &inany_loopback6; - - r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname, + r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, + addr ? addr : &inany_any6, ifname, port, uref.u32);
socks[V6][port] = r6 < 0 ? -1 : r6; @@ -1214,9 +1204,12 @@ static void udp_port_rebind(struct ctx *c, bool outbound) continue;
if ((c->ifi4 && socks[V4][port] == -1) || - (c->ifi6 && socks[V6][port] == -1)) - udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST, - NULL, NULL, port); + (c->ifi6 && socks[V6][port] == -1)) { + if (outbound) + udp_sock_init(c, PIF_SPLICE, NULL, "lo", port); + else + udp_sock_init(c, PIF_HOST, NULL, NULL, port);
Same here, should we add a fallback case? The rest of the series looks good to me.
+ } } }
-- Stefano
On Tue, Oct 21, 2025 at 11:51:12PM +0200, Stefano Brivio wrote:
On Fri, 17 Oct 2025 11:34:47 +1100 David Gibson
wrote: Currently, outbound forwards (-T, -U) are handled by sockets bound to the loopback address. Typically we create two sockets, one for 127.0.0.1 and one for ::1.
This has some disadvantages: * The guest can't connect to these services using its global IP address, it must explicitly use 127.0.0.1 or ::1 (bug 100) * The guest can't even connect via 127.0.0.0/8 addresses other than 127.0.0.1 * We can't use dual-stack sockets, we have to have separate sockets for IPv4 and IPv6.
The restriction exist for a reason though. If the guest has any interfaces other than pasta (e.g. a VPN tunnel) external hosts could reach the host via the forwards. Especially combined with -T auto / -U auto this would make it very easy to make a mistake with nasty security implications.
We can achieve both goals, however, if we don't bind the outbound listening sockets to a particular address, but _do_ use SO_BINDTODEVICE to restrict them to the "lo" interface.
Nice trick, I didn't think of it. I wonder if doing the same host-side might help solving a part of https://bugs.passt.top/show_bug.cgi?id=113 as well.
I don't think we even need to do anything host side - bug 113 arises because of where we're listening on the guest side. So this might be enough to fix it all on its own. I'm not certain, because the special case DNS handling complicates the picture there.
Link: https://bugs.passt.top/show_bug.cgi?id=100
Signed-off-by: David Gibson
--- pif.c | 6 ------ tcp.c | 18 ++---------------- udp.c | 27 ++++++++++----------------- 3 files changed, 12 insertions(+), 39 deletions(-) diff --git a/pif.c b/pif.c index 592fafaa..84e3ceae 100644 --- a/pif.c +++ b/pif.c @@ -87,12 +87,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
ASSERT(pif_is_socket(pif));
- if (pif == PIF_SPLICE) { - /* Sanity checks */ - ASSERT(!ifname); - ASSERT(addr && inany_is_loopback(addr)); - } - if (!addr) return sock_l4_sa(c, type, &sa, sizeof(sa.sa6), ifname, false, data); diff --git a/tcp.c b/tcp.c index 15c012d7..982c9190 100644 --- a/tcp.c +++ b/tcp.c @@ -2592,20 +2592,6 @@ int tcp_sock_init(const struct ctx *c, uint8_t pif,
return r4 < 0 ? r4 : r6; } -/** - * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections - * @c: Execution context - * @port: Port, host order - */ -static void tcp_ns_sock_init(const struct ctx *c, in_port_t port) -{ - ASSERT(!c->no_tcp); - - if (c->ifi4) - tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port); - if (c->ifi6) - tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port); -}
/** * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections @@ -2625,7 +2611,7 @@ static int tcp_ns_socks_init(void *arg) if (!bitmap_isset(c->tcp.fwd_out.map, port)) continue;
- tcp_ns_sock_init(c, port); + tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
I thought the "lo" string would be part of the Linux UAPI, but that's not the case, and loopback_net_init() just calls:
alloc_netdev(0, "lo", NET_NAME_PREDICTABLE, loopback_setup);
so I think it's relatively unproblematic to hardcode that as well, and it looks like we can't create a second loopback interface, even though:
$ pasta -- sh -c 'ip link set dev lo down; ip link change dev lo name lol; ip link show lol' 1: lol: <LOOPBACK> mtu 65536 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
Hm, that is a point. *thinks*. So I believe loopback always has index 1, so we could potentially use that. Except that BINDTODEVICE takes a name, not an index. I don't think looking up the name from the index then using BINDTODEVICE would do quite what we want either: IIUC, BINDTODEVICE is fixed by name not index, so if the guest changed the name of lo after we did BINDTODEVICE, then it wouldn't "follow" the interface name change (which is what you want for intermittently present interfaces, not so much here).
I don't have any quick solution and I don't think we care enough as to write a function in netlink.c fetching links with loopback type, so I'm totally fine with this as it is.
Yeah, given the above, I think this is another case of we can only go so far to stop the guest shooting itself in the foot.
By the way, if we fail to use SO_BINDTODEVICE, we already defensively close the socket. The only possible flaw that occurs to me is that somebody could rename 'lo' and then create a link called 'lo' of a different type. But that needs CAP_NET_ADMIN in the container anyway.
Right. And while that could expose host ports in ways we didn't expect, a malicious guest could already do that by running a port forwarder. So, again, I think this falls under the category of the guest being allowed to shoot itself in the foot.
}
return 0; @@ -2805,7 +2791,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound) if ((c->ifi4 && socks[port][V4] == -1) || (c->ifi6 && socks[port][V6] == -1)) { if (outbound) - tcp_ns_sock_init(c, port); + tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
Should we have/keep a fallback for pre-5.7 / pre-c427bfec18f2 kernels?
For a moment I thought we didn't need a fallback, because we'd be entering the guest ns and thereby gaining CAP_NET_RAW. But that's not the case: we only enter the guest netns for this operation, we're already in the userns and have dropped capabilities at this point (unlike the bindings we create at startup). So, good question. Having a fallback would make this a *lot* messier, and perhaps more importantly means we'd get a subtle but real behavioural difference based on kernel version which sounds like it could be pretty surprising to the user. My inclination is to say that -T auto / -U auto requires a kernel with that patch, but if you overrule me, I'll see what I can do.
else tcp_sock_init(c, PIF_HOST, NULL, NULL, port); } diff --git a/udp.c b/udp.c index 49dd0144..e38114eb 100644 --- a/udp.c +++ b/udp.c @@ -1127,26 +1127,16 @@ int udp_sock_init(const struct ctx *c, uint8_t pif, }
if ((!addr || inany_v4(addr)) && c->ifi4) { - const union inany_addr *a = addr ? - addr : &inany_any4; - - if (pif == PIF_SPLICE) - a = &inany_loopback4; - - r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname, + r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, + addr ? addr : &inany_any4, ifname, port, uref.u32);
socks[V4][port] = r4 < 0 ? -1 : r4; }
if ((!addr || !inany_v4(addr)) && c->ifi6) { - const union inany_addr *a = addr ? - addr : &inany_any6; - - if (pif == PIF_SPLICE) - a = &inany_loopback6; - - r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname, + r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, + addr ? addr : &inany_any6, ifname, port, uref.u32);
socks[V6][port] = r6 < 0 ? -1 : r6; @@ -1214,9 +1204,12 @@ static void udp_port_rebind(struct ctx *c, bool outbound) continue;
if ((c->ifi4 && socks[V4][port] == -1) || - (c->ifi6 && socks[V6][port] == -1)) - udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST, - NULL, NULL, port); + (c->ifi6 && socks[V6][port] == -1)) { + if (outbound) + udp_sock_init(c, PIF_SPLICE, NULL, "lo", port); + else + udp_sock_init(c, PIF_HOST, NULL, NULL, port);
Same here, should we add a fallback case? The rest of the series looks good to me.
Same comments as for TCP. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, 22 Oct 2025 11:34:40 +1100
David Gibson
On Tue, Oct 21, 2025 at 11:51:12PM +0200, Stefano Brivio wrote:
On Fri, 17 Oct 2025 11:34:47 +1100 David Gibson
wrote: Currently, outbound forwards (-T, -U) are handled by sockets bound to the loopback address. Typically we create two sockets, one for 127.0.0.1 and one for ::1.
This has some disadvantages: * The guest can't connect to these services using its global IP address, it must explicitly use 127.0.0.1 or ::1 (bug 100) * The guest can't even connect via 127.0.0.0/8 addresses other than 127.0.0.1 * We can't use dual-stack sockets, we have to have separate sockets for IPv4 and IPv6.
The restriction exist for a reason though. If the guest has any interfaces other than pasta (e.g. a VPN tunnel) external hosts could reach the host via the forwards. Especially combined with -T auto / -U auto this would make it very easy to make a mistake with nasty security implications.
We can achieve both goals, however, if we don't bind the outbound listening sockets to a particular address, but _do_ use SO_BINDTODEVICE to restrict them to the "lo" interface.
Nice trick, I didn't think of it. I wonder if doing the same host-side might help solving a part of https://bugs.passt.top/show_bug.cgi?id=113 as well.
I don't think we even need to do anything host side - bug 113 arises because of where we're listening on the guest side.
Ah, you're right, I guess I picked the wrong bug. I have a vague memory of another one where somebody is running a DNS proxy in a container (PiHole maybe?), bound to something in 127.0.0.0/8 but not 127.0.0.1, and we automatically bind, on the host side, to 127.0.0.1.
So this might be enough to fix it all on its own. I'm not certain, because the special case DNS handling complicates the picture there.
I guess perhaps worth a quick test with socat if checking against systemd-resolved isn't pratical, to see if we can close that one too?
Link: https://bugs.passt.top/show_bug.cgi?id=100
Signed-off-by: David Gibson
--- pif.c | 6 ------ tcp.c | 18 ++---------------- udp.c | 27 ++++++++++----------------- 3 files changed, 12 insertions(+), 39 deletions(-) diff --git a/pif.c b/pif.c index 592fafaa..84e3ceae 100644 --- a/pif.c +++ b/pif.c @@ -87,12 +87,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
ASSERT(pif_is_socket(pif));
- if (pif == PIF_SPLICE) { - /* Sanity checks */ - ASSERT(!ifname); - ASSERT(addr && inany_is_loopback(addr)); - } - if (!addr) return sock_l4_sa(c, type, &sa, sizeof(sa.sa6), ifname, false, data); diff --git a/tcp.c b/tcp.c index 15c012d7..982c9190 100644 --- a/tcp.c +++ b/tcp.c @@ -2592,20 +2592,6 @@ int tcp_sock_init(const struct ctx *c, uint8_t pif,
return r4 < 0 ? r4 : r6; } -/** - * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections - * @c: Execution context - * @port: Port, host order - */ -static void tcp_ns_sock_init(const struct ctx *c, in_port_t port) -{ - ASSERT(!c->no_tcp); - - if (c->ifi4) - tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port); - if (c->ifi6) - tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port); -}
/** * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections @@ -2625,7 +2611,7 @@ static int tcp_ns_socks_init(void *arg) if (!bitmap_isset(c->tcp.fwd_out.map, port)) continue;
- tcp_ns_sock_init(c, port); + tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
I thought the "lo" string would be part of the Linux UAPI, but that's not the case, and loopback_net_init() just calls:
alloc_netdev(0, "lo", NET_NAME_PREDICTABLE, loopback_setup);
so I think it's relatively unproblematic to hardcode that as well, and it looks like we can't create a second loopback interface, even though:
$ pasta -- sh -c 'ip link set dev lo down; ip link change dev lo name lol; ip link show lol' 1: lol: <LOOPBACK> mtu 65536 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
Hm, that is a point. *thinks*. So I believe loopback always has index 1, so we could potentially use that.
Right, see pasta_ns_conf(): rc = nl_link_set_flags(nl_sock_ns, 1 /* lo */, IFF_UP, IFF_UP);
Except that BINDTODEVICE takes a name, not an index. I don't think looking up the name from the index then using BINDTODEVICE would do quite what we want either: IIUC, BINDTODEVICE is fixed by name not index, so if the guest changed the name of lo after we did BINDTODEVICE, then it wouldn't "follow" the interface name change (which is what you want for intermittently present interfaces, not so much here).
Yes, we would need to keep querying it.
I don't have any quick solution and I don't think we care enough as to write a function in netlink.c fetching links with loopback type, so I'm totally fine with this as it is.
Yeah, given the above, I think this is another case of we can only go so far to stop the guest shooting itself in the foot.
By the way, if we fail to use SO_BINDTODEVICE, we already defensively close the socket. The only possible flaw that occurs to me is that somebody could rename 'lo' and then create a link called 'lo' of a different type. But that needs CAP_NET_ADMIN in the container anyway.
Right. And while that could expose host ports in ways we didn't expect, a malicious guest could already do that by running a port forwarder. So, again, I think this falls under the category of the guest being allowed to shoot itself in the foot.
Indeed.
}
return 0; @@ -2805,7 +2791,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound) if ((c->ifi4 && socks[port][V4] == -1) || (c->ifi6 && socks[port][V6] == -1)) { if (outbound) - tcp_ns_sock_init(c, port); + tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
Should we have/keep a fallback for pre-5.7 / pre-c427bfec18f2 kernels?
For a moment I thought we didn't need a fallback, because we'd be entering the guest ns and thereby gaining CAP_NET_RAW. But that's not the case: we only enter the guest netns for this operation, we're already in the userns and have dropped capabilities at this point (unlike the bindings we create at startup).
So, good question. Having a fallback would make this a *lot* messier,
Hmm, why? I didn't test this (and I'm quite confused by inany_from_sockaddr() right now) but: --- diff --git a/util.c b/util.c index c492f90..6b04040 100644 --- a/util.c +++ b/util.c @@ -126,17 +126,33 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type, * ("net: core: enable SO_BINDTODEVICE for non-root users"). If * it's unsupported, don't bind the socket at all, because the * user might rely on this to filter incoming connections. + * + * As an exception, if we want to bind to 'lo', approximate that + * by binding to 127.0.0.1 (which might be the wrong loopback + * address for IPv4) or ::1 (always correct, for IPv6). This + * adds https://bugs.passt.top/show_bug.cgi?id=100 back for + * pre-5.7 kernels. */ if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, ifname, strlen(ifname))) { char str[SOCKADDR_STRLEN]; - ret = -errno; - warn("Can't bind %s socket for %s to %s, closing", - EPOLL_TYPE_STR(proto), - sockaddr_ntop(sa, str, sizeof(str)), ifname); - close(fd); - return ret; + if (errno == EPERM && !strcmp(ifname, "lo")) { + /* I just realised inany_from_sockaddr() doesn't + * actually take a sockaddr as source...? Do we + * have another function doing that now? + * + * Well, change the address in 'sa' and warn(). + */ + ; + } else { + ret = -errno; + warn("Can't bind %s socket for %s to %s, closing", + EPOLL_TYPE_STR(proto), + sockaddr_ntop(sa, str, sizeof(str)), ifname); + close(fd); + return ret; + } } } ---
and perhaps more importantly means we'd get a subtle but real behavioural difference based on kernel version which sounds like it could be pretty surprising to the user. My inclination is to say that -T auto / -U auto requires a kernel with that patch, but if you overrule me, I'll see what I can do.
See https://bugs.passt.top/show_bug.cgi?id=121 for a clear indication that we have users on 4.18 (RHEL 8) kernels. On Debian, Buster (4.19 kernels) was the 'oldoldstable' until recently, and extended long term support ends in 2029. If this was a new feature, I would agree. But with this, we are introducing a regression and we might break some setups.
else tcp_sock_init(c, PIF_HOST, NULL, NULL, port); } diff --git a/udp.c b/udp.c index 49dd0144..e38114eb 100644 --- a/udp.c +++ b/udp.c @@ -1127,26 +1127,16 @@ int udp_sock_init(const struct ctx *c, uint8_t pif, }
if ((!addr || inany_v4(addr)) && c->ifi4) { - const union inany_addr *a = addr ? - addr : &inany_any4; - - if (pif == PIF_SPLICE) - a = &inany_loopback4; - - r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname, + r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, + addr ? addr : &inany_any4, ifname, port, uref.u32);
socks[V4][port] = r4 < 0 ? -1 : r4; }
if ((!addr || !inany_v4(addr)) && c->ifi6) { - const union inany_addr *a = addr ? - addr : &inany_any6; - - if (pif == PIF_SPLICE) - a = &inany_loopback6; - - r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname, + r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, + addr ? addr : &inany_any6, ifname, port, uref.u32);
socks[V6][port] = r6 < 0 ? -1 : r6; @@ -1214,9 +1204,12 @@ static void udp_port_rebind(struct ctx *c, bool outbound) continue;
if ((c->ifi4 && socks[V4][port] == -1) || - (c->ifi6 && socks[V6][port] == -1)) - udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST, - NULL, NULL, port); + (c->ifi6 && socks[V6][port] == -1)) { + if (outbound) + udp_sock_init(c, PIF_SPLICE, NULL, "lo", port); + else + udp_sock_init(c, PIF_HOST, NULL, NULL, port);
Same here, should we add a fallback case? The rest of the series looks good to me.
Same comments as for TCP.
-- Stefano
On Wed, Oct 22, 2025 at 10:59:16AM +0200, Stefano Brivio wrote:
On Wed, 22 Oct 2025 11:34:40 +1100 David Gibson
wrote: On Tue, Oct 21, 2025 at 11:51:12PM +0200, Stefano Brivio wrote:
On Fri, 17 Oct 2025 11:34:47 +1100 David Gibson
wrote: Currently, outbound forwards (-T, -U) are handled by sockets bound to the loopback address. Typically we create two sockets, one for 127.0.0.1 and one for ::1.
This has some disadvantages: * The guest can't connect to these services using its global IP address, it must explicitly use 127.0.0.1 or ::1 (bug 100) * The guest can't even connect via 127.0.0.0/8 addresses other than 127.0.0.1 * We can't use dual-stack sockets, we have to have separate sockets for IPv4 and IPv6.
The restriction exist for a reason though. If the guest has any interfaces other than pasta (e.g. a VPN tunnel) external hosts could reach the host via the forwards. Especially combined with -T auto / -U auto this would make it very easy to make a mistake with nasty security implications.
We can achieve both goals, however, if we don't bind the outbound listening sockets to a particular address, but _do_ use SO_BINDTODEVICE to restrict them to the "lo" interface.
Nice trick, I didn't think of it. I wonder if doing the same host-side might help solving a part of https://bugs.passt.top/show_bug.cgi?id=113 as well.
I don't think we even need to do anything host side - bug 113 arises because of where we're listening on the guest side.
Ah, you're right, I guess I picked the wrong bug. I have a vague memory of another one where somebody is running a DNS proxy in a container (PiHole maybe?), bound to something in 127.0.0.0/8 but not 127.0.0.1, and we automatically bind, on the host side, to 127.0.0.1.
So this might be enough to fix it all on its own. I'm not certain, because the special case DNS handling complicates the picture there.
I guess perhaps worth a quick test with socat if checking against systemd-resolved isn't pratical, to see if we can close that one too?
Right, I'll have a look when I can.
Link: https://bugs.passt.top/show_bug.cgi?id=100
Signed-off-by: David Gibson
--- pif.c | 6 ------ tcp.c | 18 ++---------------- udp.c | 27 ++++++++++----------------- 3 files changed, 12 insertions(+), 39 deletions(-) diff --git a/pif.c b/pif.c index 592fafaa..84e3ceae 100644 --- a/pif.c +++ b/pif.c @@ -87,12 +87,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
ASSERT(pif_is_socket(pif));
- if (pif == PIF_SPLICE) { - /* Sanity checks */ - ASSERT(!ifname); - ASSERT(addr && inany_is_loopback(addr)); - } - if (!addr) return sock_l4_sa(c, type, &sa, sizeof(sa.sa6), ifname, false, data); diff --git a/tcp.c b/tcp.c index 15c012d7..982c9190 100644 --- a/tcp.c +++ b/tcp.c @@ -2592,20 +2592,6 @@ int tcp_sock_init(const struct ctx *c, uint8_t pif,
return r4 < 0 ? r4 : r6; } -/** - * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections - * @c: Execution context - * @port: Port, host order - */ -static void tcp_ns_sock_init(const struct ctx *c, in_port_t port) -{ - ASSERT(!c->no_tcp); - - if (c->ifi4) - tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port); - if (c->ifi6) - tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port); -}
/** * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections @@ -2625,7 +2611,7 @@ static int tcp_ns_socks_init(void *arg) if (!bitmap_isset(c->tcp.fwd_out.map, port)) continue;
- tcp_ns_sock_init(c, port); + tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
I thought the "lo" string would be part of the Linux UAPI, but that's not the case, and loopback_net_init() just calls:
alloc_netdev(0, "lo", NET_NAME_PREDICTABLE, loopback_setup);
so I think it's relatively unproblematic to hardcode that as well, and it looks like we can't create a second loopback interface, even though:
$ pasta -- sh -c 'ip link set dev lo down; ip link change dev lo name lol; ip link show lol' 1: lol: <LOOPBACK> mtu 65536 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
Hm, that is a point. *thinks*. So I believe loopback always has index 1, so we could potentially use that.
Right, see pasta_ns_conf():
rc = nl_link_set_flags(nl_sock_ns, 1 /* lo */, IFF_UP, IFF_UP);
Except that BINDTODEVICE takes a name, not an index. I don't think looking up the name from the index then using BINDTODEVICE would do quite what we want either: IIUC, BINDTODEVICE is fixed by name not index, so if the guest changed the name of lo after we did BINDTODEVICE, then it wouldn't "follow" the interface name change (which is what you want for intermittently present interfaces, not so much here).
Yes, we would need to keep querying it.
I don't have any quick solution and I don't think we care enough as to write a function in netlink.c fetching links with loopback type, so I'm totally fine with this as it is.
Yeah, given the above, I think this is another case of we can only go so far to stop the guest shooting itself in the foot.
By the way, if we fail to use SO_BINDTODEVICE, we already defensively close the socket. The only possible flaw that occurs to me is that somebody could rename 'lo' and then create a link called 'lo' of a different type. But that needs CAP_NET_ADMIN in the container anyway.
Right. And while that could expose host ports in ways we didn't expect, a malicious guest could already do that by running a port forwarder. So, again, I think this falls under the category of the guest being allowed to shoot itself in the foot.
Indeed.
}
return 0; @@ -2805,7 +2791,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound) if ((c->ifi4 && socks[port][V4] == -1) || (c->ifi6 && socks[port][V6] == -1)) { if (outbound) - tcp_ns_sock_init(c, port); + tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
Should we have/keep a fallback for pre-5.7 / pre-c427bfec18f2 kernels?
For a moment I thought we didn't need a fallback, because we'd be entering the guest ns and thereby gaining CAP_NET_RAW. But that's not the case: we only enter the guest netns for this operation, we're already in the userns and have dropped capabilities at this point (unlike the bindings we create at startup).
So, good question. Having a fallback would make this a *lot* messier,
Hmm, why? I didn't test this (and I'm quite confused by inany_from_sockaddr() right now) but:
Ah, I didn't think of doing the workaround at this level. That does help somewhat.
--- diff --git a/util.c b/util.c index c492f90..6b04040 100644 --- a/util.c +++ b/util.c @@ -126,17 +126,33 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type, * ("net: core: enable SO_BINDTODEVICE for non-root users"). If * it's unsupported, don't bind the socket at all, because the * user might rely on this to filter incoming connections. + * + * As an exception, if we want to bind to 'lo', approximate that + * by binding to 127.0.0.1 (which might be the wrong loopback + * address for IPv4) or ::1 (always correct, for IPv6). This + * adds https://bugs.passt.top/show_bug.cgi?id=100 back for + * pre-5.7 kernels. */ if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, ifname, strlen(ifname))) { char str[SOCKADDR_STRLEN];
- ret = -errno; - warn("Can't bind %s socket for %s to %s, closing", - EPOLL_TYPE_STR(proto), - sockaddr_ntop(sa, str, sizeof(str)), ifname); - close(fd); - return ret; + if (errno == EPERM && !strcmp(ifname, "lo")) { + /* I just realised inany_from_sockaddr() doesn't + * actually take a sockaddr as source...? Do we
It does, the description's just not great. It takes the sockaddr as a (void *) @addr so that callers can pass a struct sockaddr_whatever without casting. I'll try to improve this.
+ * have another function doing that now? + * + * Well, change the address in 'sa' and warn().
We need to specially handle the dual stack case, to force the caller to split into separate v4 and v6 sockets here.
+ */ + ; + } else { + ret = -errno; + warn("Can't bind %s socket for %s to %s, closing", + EPOLL_TYPE_STR(proto), + sockaddr_ntop(sa, str, sizeof(str)), ifname); + close(fd); + return ret; + } } }
---
and perhaps more importantly means we'd get a subtle but real behavioural difference based on kernel version which sounds like it could be pretty surprising to the user. My inclination is to say that -T auto / -U auto requires a kernel with that patch, but if you overrule me, I'll see what I can do.
See https://bugs.passt.top/show_bug.cgi?id=121 for a clear indication that we have users on 4.18 (RHEL 8) kernels. On Debian, Buster (4.19 kernels) was the 'oldoldstable' until recently, and extended long term support ends in 2029.
Poop.
If this was a new feature, I would agree. But with this, we are introducing a regression and we might break some setups.
Yeah, ok. I'll figure something out. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (2)
-
David Gibson
-
Stefano Brivio