[PATCH v3 0/8] Reduce differences between inbound and outbound socket binding

newer
[PATCH 1/3] Revert "fwd: Update...

David Gibson

29 Oct 2025 29 Oct '25

7:26 a.m.

The fact that outbound forwarding sockets are bound to the loopback address, whereas inbound forwarding sockets are (by default) bound to the unspecified address leads to some unexpected differences between the paths setting up each of them. An idea for tackling bug 100 suggested a different approach which will also reduce some of those differences and allow more code to be shared between the two paths. I've since discovered that this approach may not help for bug 100, but it might still be worthwhile for the clean up. Patches 1..6/8 are cleanups which shouldn't change behaviour, and I think are ready to merge. 7/8 is (arguably) a behavioural change, but I've made my case for it in the patch comment. 8/8 needs some further consideration, since I've discovered it does not fix bug 100 as is, I'm including it for advance review, though. v3: - A number of additional fixes covering the handling of IPV6_V6ONLY sockopt - Assorted trivial changes v2: - Some rearrangements and rewordings for clarity David Gibson (8): inany: Let length of sockaddr_inany be implicit from the family util, flow, pif: Simplify sock_l4_sa() interface tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one() udp: Unify some more inbound/outbound parts of udp_sock_init() udp: Move udp_sock_init() special case to its caller util: Fix setting of IPV6_V6ONLY socket option tcp, udp: Remove fallback if creating dual stack socket fails [RFC, DO NOT APPLY] tcp, udp: Bind outbound listening sockets by interface instead of address conf.c | 4 +- flow.c | 19 +++--- icmp.c | 3 +- inany.h | 18 ++++++ pif.c | 26 ++------ pif.h | 2 +- tcp.c | 164 +++++++++++++++++++-------------------------------- tcp.h | 5 +- tcp_splice.c | 5 +- udp.c | 102 +++++++++++++++----------------- udp.h | 5 +- util.c | 55 +++++++++++++---- util.h | 9 ++- 13 files changed, 201 insertions(+), 216 deletions(-) -- 2.51.0

Show replies by date

David Gibson

29 Oct 29 Oct

7:26 a.m.

New subject: [PATCH v3 4/8] udp: Unify some more inbound/outbound parts of udp_sock_init()

udp_sock_init() takes an 'ns' parameter determining if it creates a socket in the guest namespace or host namespace. Alter it to take a pif parameter instead, like tcp_sock_init(), and use that change to slightly reduce code duplication between the HOST and SPLICE cases. Signed-off-by: David Gibson --- conf.c | 2 +- udp.c | 65 +++++++++++++++++++++++++++++++--------------------------- udp.h | 5 +++-- 3 files changed, 39 insertions(+), 33 deletions(-) diff --git a/conf.c b/conf.c index 26f1bcc0..08cb50aa 100644 --- a/conf.c +++ b/conf.c @@ -171,7 +171,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname, if (optname == 't') ret = tcp_sock_init(c, PIF_HOST, addr, ifname, i); else if (optname == 'u') - ret = udp_sock_init(c, 0, addr, ifname, i); + ret = udp_sock_init(c, PIF_HOST, addr, ifname, i); else /* No way to check in advance for -T and -U */ ret = 0; diff --git a/udp.c b/udp.c index 8f96495d..f3237436 100644 --- a/udp.c +++ b/udp.c @@ -1091,64 +1091,68 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, /** * udp_sock_init() - Initialise listening sockets for a given port * @c: Execution context - * @ns: In pasta mode, if set, bind with loopback address in namespace + * @pif: Interface to open the socket for (PIF_HOST or PIF_SPLICE) * @addr: Pointer to address for binding, NULL if not configured * @ifname: Name of interface to bind to, NULL if not configured * @port: Port, host order * * Return: 0 on (partial) success, negative error code on (complete) failure */ -int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr, - const char *ifname, in_port_t port) +int udp_sock_init(const struct ctx *c, uint8_t pif, + const union inany_addr *addr, const char *ifname, + in_port_t port) { union udp_listen_epoll_ref uref = { - .pif = ns ? PIF_SPLICE : PIF_HOST, + .pif = pif, .port = port, }; int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1; + int (*socks)[NUM_PORTS]; ASSERT(!c->no_udp); + ASSERT(pif_is_socket(pif)); - if (!addr && c->ifi4 && c->ifi6 && !ns) { + if (pif == PIF_HOST) + socks = udp_splice_init; + else + socks = udp_splice_ns; + + if (!addr && c->ifi4 && c->ifi6 && pif == PIF_HOST) { int s; /* Attempt to get a dual stack socket */ s = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST, NULL, ifname, port, uref.u32); - udp_splice_init[V4][port] = s < 0 ? -1 : s; - udp_splice_init[V6][port] = s < 0 ? -1 : s; + socks[V4][port] = s < 0 ? -1 : s; + socks[V6][port] = s < 0 ? -1 : s; if (IN_INTERVAL(0, FD_REF_MAX, s)) return 0; } if ((!addr || inany_v4(addr)) && c->ifi4) { - if (!ns) { - r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST, - addr ? addr : &inany_any4, ifname, - port, uref.u32); + const union inany_addr *a = addr ? + addr : &inany_any4; - udp_splice_init[V4][port] = r4 < 0 ? -1 : r4; - } else { - r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE, - &inany_loopback4, ifname, - port, uref.u32); - udp_splice_ns[V4][port] = r4 < 0 ? -1 : r4; - } + if (pif == PIF_SPLICE) + a = &inany_loopback4; + + r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname, + port, uref.u32); + + socks[V4][port] = r4 < 0 ? -1 : r4; } if ((!addr || !inany_v4(addr)) && c->ifi6) { - if (!ns) { - r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST, - addr ? addr : &inany_any6, ifname, - port, uref.u32); + const union inany_addr *a = addr ? + addr : &inany_any6; - udp_splice_init[V6][port] = r6 < 0 ? -1 : r6; - } else { - r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE, - &inany_loopback6, ifname, - port, uref.u32); - udp_splice_ns[V6][port] = r6 < 0 ? -1 : r6; - } + if (pif == PIF_SPLICE) + a = &inany_loopback6; + + r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname, + port, uref.u32); + + socks[V6][port] = r6 < 0 ? -1 : r6; } if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6)) @@ -1214,7 +1218,8 @@ static void udp_port_rebind(struct ctx *c, bool outbound) if ((c->ifi4 && socks[V4][port] == -1) || (c->ifi6 && socks[V6][port] == -1)) - udp_sock_init(c, outbound, NULL, NULL, port); + udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST, + NULL, NULL, port); } } diff --git a/udp.h b/udp.h index 8f8531ad..f78dc528 100644 --- a/udp.h +++ b/udp.h @@ -17,8 +17,9 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, uint8_t ttl, const struct pool *p, int idx, const struct timespec *now); -int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr, - const char *ifname, in_port_t port); +int udp_sock_init(const struct ctx *c, uint8_t pif, + const union inany_addr *addr, const char *ifname, + in_port_t port); int udp_init(struct ctx *c); void udp_timer(struct ctx *c, const struct timespec *now); void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s); -- 2.51.0

David Gibson

7:26 a.m.

New subject: [PATCH v3 7/8] tcp, udp: Remove fallback if creating dual stack socket fails

To save kernel memory we try to use "dual stack" sockets which can listen for both IPv4 and IPv6 connections where possible. To support kernels which don't allow dual stack sockets, we fall back to creating individual sockets for IPv4 and IPv6. This fallback causes some mild ugliness now, and will cause more difficulty with upcoming improvements to the forwarding logic. I don't think we need the fallback on the following grounds: 1) The fallback was broken since inception: The fallback was triggered if pif_sock_l4() failed attempting to create the dual stack socket. But even if the kernel didn't suppor them, pif_sock_l4() would not report a failure. - Dual stack sockets are distinguished by having the IPV6_V6ONLY sockopt set to 0. However, until the last patch, we only called setsockopt() if we wanted to set this to 1, so there was no kernel operation which could fail for dual stack sockets - we'd silently create a IPv6 only socket instead. - Even if we did call the setsockopt(), we only printed a debug() message for failures, we didn't report it to the caller 2) Dual stack sockets are not just a Linux extension The dual stack socket interface is described in RFC3493, specifically section 3.7 and section 5.3. It is supported on BSD: https://man.freebsd.org/cgi/man.cgi?query=ip6 and on Windows: https://learn.microsoft.com/en-us/windows/win32/winsock/ipproto-ipv6-socket-... 3) Linux has supported dual stack sockets for over 20 years According to ipv6(7) the IPV6_V6ONLY socket option was introduced in Linux 2.6 and Linux 2.4.21 both from 2003. Signed-off-by: David Gibson --- tcp.c | 43 +++++++++++++++++++++++++------------------ udp.c | 58 ++++++++++++++++++++++++++-------------------------------- 2 files changed, 51 insertions(+), 50 deletions(-) diff --git a/tcp.c b/tcp.c index 5f4fc12c..a19f0ba9 100644 --- a/tcp.c +++ b/tcp.c @@ -2559,42 +2559,49 @@ static int tcp_sock_init_one(const struct ctx *c, uint8_t pif, } /** - * tcp_sock_init() - Create listening sockets for a given host ("inbound") port + * tcp_sock_init() - Create listening socket for a given host ("inbound") port * @c: Execution context * @pif: Interface to open the socket for (PIF_HOST or PIF_SPLICE) * @addr: Pointer to address for binding, NULL if not configured * @ifname: Name of interface to bind to, NULL if not configured * @port: Port, host order * - * Return: 0 on (partial) success, negative error code on (complete) failure + * Return: 0 on success, negative error code on failure */ int tcp_sock_init(const struct ctx *c, uint8_t pif, const union inany_addr *addr, const char *ifname, in_port_t port) { - int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1; + int s; ASSERT(!c->no_tcp); - if (!addr && c->ifi4 && c->ifi6) - /* Attempt to get a dual stack socket */ - if (tcp_sock_init_one(c, pif, NULL, ifname, port) >= 0) + if (!c->ifi4) { + if (!addr) + /* Restrict to v6 only */ + addr = &inany_any6; + else if (inany_v4(addr)) + /* Nothing to do */ return 0; + } + if (!c->ifi6) { + if (!addr) + /* Restrict to v4 only */ + addr = &inany_any4; + else if (!inany_v4(addr)) + /* Nothing to do */ + return 0; + } - /* Otherwise create a socket per IP version */ - if ((!addr || inany_v4(addr)) && c->ifi4) - r4 = tcp_sock_init_one(c, pif, - addr ? addr : &inany_any4, ifname, port); - - if ((!addr || !inany_v4(addr)) && c->ifi6) - r6 = tcp_sock_init_one(c, pif, - addr ? addr : &inany_any6, ifname, port); - - if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6)) - return 0; + s = tcp_sock_init_one(c, pif, addr, ifname, port); + if (s < 0) + return s; + if (s > FD_REF_MAX) + return -EIO; - return r4 < 0 ? r4 : r6; + return 0; } + /** * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections * @c: Execution context diff --git a/udp.c b/udp.c index ee8334fc..5754c436 100644 --- a/udp.c +++ b/udp.c @@ -1089,14 +1089,14 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, } /** - * udp_sock_init() - Initialise listening sockets for a given port + * udp_sock_init() - Initialise listening socket for a given port * @c: Execution context * @pif: Interface to open the socket for (PIF_HOST or PIF_SPLICE) * @addr: Pointer to address for binding, NULL if not configured * @ifname: Name of interface to bind to, NULL if not configured * @port: Port, host order * - * Return: 0 on (partial) success, negative error code on (complete) failure + * Return: 0 on success, negative error code on failure */ int udp_sock_init(const struct ctx *c, uint8_t pif, const union inany_addr *addr, const char *ifname, @@ -1106,8 +1106,8 @@ int udp_sock_init(const struct ctx *c, uint8_t pif, .pif = pif, .port = port, }; - int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1; int (*socks)[NUM_PORTS]; + int s; ASSERT(!c->no_udp); ASSERT(pif_is_socket(pif)); @@ -1117,42 +1117,36 @@ int udp_sock_init(const struct ctx *c, uint8_t pif, else socks = udp_splice_ns; - if (!addr && c->ifi4 && c->ifi6) { - int s; - - /* Attempt to get a dual stack socket */ - s = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST, - NULL, ifname, port, uref.u32); - socks[V4][port] = s < 0 ? -1 : s; - socks[V6][port] = s < 0 ? -1 : s; - if (IN_INTERVAL(0, FD_REF_MAX, s)) + if (!c->ifi4) { + if (!addr) + /* Restrict to v6 only */ + addr = &inany_any6; + else if (inany_v4(addr)) + /* Nothing to do */ return 0; } - - if ((!addr || inany_v4(addr)) && c->ifi4) { - const union inany_addr *a = addr ? - addr : &inany_any4; - - r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname, - port, uref.u32); - - socks[V4][port] = r4 < 0 ? -1 : r4; + if (!c->ifi6) { + if (!addr) + /* Restrict to v4 only */ + addr = &inany_any4; + else if (!inany_v4(addr)) + /* Nothing to do */ + return 0; } - if ((!addr || !inany_v4(addr)) && c->ifi6) { - const union inany_addr *a = addr ? - addr : &inany_any6; - - r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname, - port, uref.u32); - - socks[V6][port] = r6 < 0 ? -1 : r6; + s = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, + addr, ifname, port, uref.u32); + if (s > FD_REF_MAX) { + close(s); + s = -EIO; } - if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6)) - return 0; + if (!addr || inany_v4(addr)) + socks[V4][port] = s < 0 ? -1 : s; + if (!addr || !inany_v4(addr)) + socks[V6][port] = s < 0 ? -1 : s; - return r4 < 0 ? r4 : r6; + return s < 0 ? s : 0; } /** -- 2.51.0

David Gibson

7:26 a.m.

New subject: [PATCH v3 5/8] udp: Move udp_sock_init() special case to its caller

Inbound sockets are bound to the unspecified address by default, whereas outbound sockets are bound to the loopback address. This is currently handled in udp_sock_init() which ignores its addr argument in the outbound case. Move the handling of this special case to the caller, udp_port_rebind(). This makes the semantics of the 'addr' argument more consistent, and will make future changes easier. Signed-off-by: David Gibson --- udp.c | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/udp.c b/udp.c index f3237436..ee8334fc 100644 --- a/udp.c +++ b/udp.c @@ -1117,7 +1117,7 @@ int udp_sock_init(const struct ctx *c, uint8_t pif, else socks = udp_splice_ns; - if (!addr && c->ifi4 && c->ifi6 && pif == PIF_HOST) { + if (!addr && c->ifi4 && c->ifi6) { int s; /* Attempt to get a dual stack socket */ @@ -1133,9 +1133,6 @@ int udp_sock_init(const struct ctx *c, uint8_t pif, const union inany_addr *a = addr ? addr : &inany_any4; - if (pif == PIF_SPLICE) - a = &inany_loopback4; - r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname, port, uref.u32); @@ -1146,9 +1143,6 @@ int udp_sock_init(const struct ctx *c, uint8_t pif, const union inany_addr *a = addr ? addr : &inany_any6; - if (pif == PIF_SPLICE) - a = &inany_loopback6; - r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname, port, uref.u32); @@ -1217,9 +1211,16 @@ static void udp_port_rebind(struct ctx *c, bool outbound) continue; if ((c->ifi4 && socks[V4][port] == -1) || - (c->ifi6 && socks[V6][port] == -1)) - udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST, - NULL, NULL, port); + (c->ifi6 && socks[V6][port] == -1)) { + if (outbound) { + udp_sock_init(c, PIF_SPLICE, + &inany_loopback4, NULL, port); + udp_sock_init(c, PIF_SPLICE, + &inany_loopback6, NULL, port); + } else { + udp_sock_init(c, PIF_HOST, NULL, NULL, port); + } + } } } -- 2.51.0

David Gibson

7:26 a.m.

New subject: [PATCH v3 6/8] util: Fix setting of IPV6_V6ONLY socket option

Currently we only call setsockopt() on IPV6_V6ONLY when we want to set it to 1, which we typically do on all IPv6 sockets except those explicitly for dual stack listening. That's not quite right in two ways: * Although IPV6_V6ONLY==0 is normally the default on Linux, that can be changed with the net.ipv6.bindv6only sysctl. It may also have different defaults on other OSes if we ever support them. We know we need it off for dual stack sockets, so explicitly set it to 0 in that case. * At the same time setting IPV6_V6ONLY to 1 for IPv6 sockets bound to a specific address is harmless but pointless. Don't set the option at all in this case, saving a syscall. Signed-off-by: David Gibson --- util.c | 27 +++++++++++++++++++++------ 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/util.c b/util.c index c94efae4..62f43895 100644 --- a/util.c +++ b/util.c @@ -45,14 +45,14 @@ * @type: epoll type * @sa: Socket address to bind to * @ifname: Interface for binding, NULL for any - * @v6only: Set IPV6_V6ONLY socket option + * @v6only: If >= 0, set IPV6_V6ONLY socket option to this value * @data: epoll reference portion for protocol handlers * * Return: newly created socket, negative error code on failure */ static int sock_l4_(const struct ctx *c, enum epoll_type type, const union sockaddr_inany *sa, const char *ifname, - bool v6only, uint32_t data) + int v6only, uint32_t data) { sa_family_t af = sa->sa_family; union epoll_ref ref = { .type = type, .data = data }; @@ -101,9 +101,11 @@ static int sock_l4_(const struct ctx *c, enum epoll_type type, ref.fd = fd; - if (v6only) - if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &y, sizeof(y))) - debug("Failed to set IPV6_V6ONLY on socket %i", fd); + if (v6only >= 0) + if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, + &v6only, sizeof(v6only))) + debug("Failed to set IPV6_V6ONLY to %d on socket %i", + v6only, fd); if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &y, sizeof(y))) debug("Failed to set SO_REUSEADDR on socket %i", fd); @@ -186,7 +188,16 @@ int sock_l4(const struct ctx *c, enum epoll_type type, const union sockaddr_inany *sa, const char *ifname, uint32_t data) { - return sock_l4_(c, type, sa, ifname, sa->sa_family == AF_INET6, data); + int v6only = -1; + + /* The option doesn't exist for IPv4 sockets, and is irrelevant for IPv6 + * sockets with a non-wildcard address. + */ + if (sa->sa_family == AF_INET6 && + IN6_IS_ADDR_UNSPECIFIED(&sa->sa6.sin6_addr)) + v6only = 1; + + return sock_l4_(c, type, sa, ifname, v6only, data); } int sock_l4_dualstack(const struct ctx *c, enum epoll_type type, @@ -198,6 +209,10 @@ int sock_l4_dualstack(const struct ctx *c, enum epoll_type type, .sa6.sin6_port = htons(port), }; + /* Dual stack sockets require IPV6_V6ONLY == 0. Usually that's the + * default, but sysctl net.ipv6.bindv6only can change that, so set the + * sockopt explicitly. + */ return sock_l4_(c, type, &sa, ifname, 0, data); } -- 2.51.0

David Gibson

7:26 a.m.

New subject: [PATCH v3 8/8] [RFC, DO NOT APPLY] tcp, udp: Bind outbound listening sockets by interface instead of address

NOTE: I've discovered this approach doesn't work to address bug 100, although it may still be worthwhile for secondary reasons. Currently, outbound forwards (-T, -U) are handled by sockets bound to the loopback address. Typically we create two sockets, one for 127.0.0.1 and one for ::1. This has some disadvantages: * The guest can't connect to these services using its global IP address, it must explicitly use 127.0.0.1 or ::1 (bug 100) * The guest can't even connect via 127.0.0.0/8 addresses other than 127.0.0.1 * We can't use dual-stack sockets, we have to have separate sockets for IPv4 and IPv6. The restriction exist for a reason though. If the guest has any interfaces other than pasta (e.g. a VPN tunnel) external hosts could reach the host via the forwards. Especially combined with -T auto / -U auto this would make it very easy to make a mistake with nasty security implications. We can achieve both goals, however, if we don't bind the outbound listening sockets to a particular address, but _do_ use SO_BINDTODEVICE to restrict them to the "lo" interface. Link: https://bugs.passt.top/show_bug.cgi?id=100 Signed-off-by: David Gibson --- pif.c | 6 ------ tcp.c | 19 ++----------------- udp.c | 10 +++------- 3 files changed, 5 insertions(+), 30 deletions(-) diff --git a/pif.c b/pif.c index 5fb1f455..20b5cb99 100644 --- a/pif.c +++ b/pif.c @@ -79,12 +79,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, ASSERT(pif_is_socket(pif)); - if (pif == PIF_SPLICE) { - /* Sanity checks */ - ASSERT(!ifname); - ASSERT(addr && inany_is_loopback(addr)); - } - if (!addr) return sock_l4_dualstack(c, type, port, ifname, data); diff --git a/tcp.c b/tcp.c index a19f0ba9..841c60fa 100644 --- a/tcp.c +++ b/tcp.c @@ -2602,21 +2602,6 @@ int tcp_sock_init(const struct ctx *c, uint8_t pif, return 0; } -/** - * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections - * @c: Execution context - * @port: Port, host order - */ -static void tcp_ns_sock_init(const struct ctx *c, in_port_t port) -{ - ASSERT(!c->no_tcp); - - if (c->ifi4) - tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port); - if (c->ifi6) - tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port); -} - /** * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections * @arg: Execution context @@ -2635,7 +2620,7 @@ static int tcp_ns_socks_init(void *arg) if (!bitmap_isset(c->tcp.fwd_out.map, port)) continue; - tcp_ns_sock_init(c, port); + tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port); } return 0; @@ -2815,7 +2800,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound) if ((c->ifi4 && socks[port][V4] == -1) || (c->ifi6 && socks[port][V6] == -1)) { if (outbound) - tcp_ns_sock_init(c, port); + tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port); else tcp_sock_init(c, PIF_HOST, NULL, NULL, port); } diff --git a/udp.c b/udp.c index 5754c436..4f796077 100644 --- a/udp.c +++ b/udp.c @@ -1206,14 +1206,10 @@ static void udp_port_rebind(struct ctx *c, bool outbound) if ((c->ifi4 && socks[V4][port] == -1) || (c->ifi6 && socks[V6][port] == -1)) { - if (outbound) { - udp_sock_init(c, PIF_SPLICE, - &inany_loopback4, NULL, port); - udp_sock_init(c, PIF_SPLICE, - &inany_loopback6, NULL, port); - } else { + if (outbound) + udp_sock_init(c, PIF_SPLICE, NULL, "lo", port); + else udp_sock_init(c, PIF_HOST, NULL, NULL, port); - } } } } -- 2.51.0

David Gibson

30 Oct 30 Oct

4:58 a.m.

On Wed, Oct 29, 2025 at 05:26:20PM +1100, David Gibson wrote:

...

The fact that outbound forwarding sockets are bound to the loopback address, whereas inbound forwarding sockets are (by default) bound to the unspecified address leads to some unexpected differences between the paths setting up each of them.

An idea for tackling bug 100 suggested a different approach which will also reduce some of those differences and allow more code to be shared between the two paths. I've since discovered that this approach may not help for bug 100, but it might still be worthwhile for the clean up.

Patches 1..6/8 are cleanups which shouldn't change behaviour, and I think are ready to merge. 7/8 is (arguably) a behavioural change, but I've made my case for it in the patch comment. 8/8 needs some further consideration, since I've discovered it does not fix bug 100 as is, I'm including it for advance review, though.

Follow up on 8/8 now I've looked into SO_BINDTODEVICE semantics more clearly. Unfortunately this approach won't work to fix bug 100 [0]. I still think the change is useful for the internals and it may also help with some other bugs such as 113. So, I don't think the code of 8/8 needs to change, but the comments and commit message do. Here's my plan: - I'll wait for review on this draft - If it needs a respin, I'll make those doc changes in the next version - If the series doesn't need a respin, go ahead and apply 1..7, and I'll send a revised 8/8 separately. [0] https://bugs.passt.top/show_bug.cgi?id=100#c8 -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

13 Nov 13 Nov

7:33 a.m.

New subject: [PATCH v3 4/8] udp: Unify some more inbound/outbound parts of udp_sock_init()

On Wed, 29 Oct 2025 17:26:24 +1100 David Gibson wrote:

...

udp_sock_init() takes an 'ns' parameter determining if it creates a socket in the guest namespace or host namespace. Alter it to take a pif parameter instead, like tcp_sock_init(), and use that change to slightly reduce code duplication between the HOST and SPLICE cases.

Signed-off-by: David Gibson --- conf.c | 2 +- udp.c | 65 +++++++++++++++++++++++++++++++--------------------------- udp.h | 5 +++-- 3 files changed, 39 insertions(+), 33 deletions(-)

diff --git a/conf.c b/conf.c index 26f1bcc0..08cb50aa 100644 --- a/conf.c +++ b/conf.c @@ -171,7 +171,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname, if (optname == 't') ret = tcp_sock_init(c, PIF_HOST, addr, ifname, i); else if (optname == 'u') - ret = udp_sock_init(c, 0, addr, ifname, i); + ret = udp_sock_init(c, PIF_HOST, addr, ifname, i); else /* No way to check in advance for -T and -U */ ret = 0; diff --git a/udp.c b/udp.c index 8f96495d..f3237436 100644 --- a/udp.c +++ b/udp.c @@ -1091,64 +1091,68 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, /** * udp_sock_init() - Initialise listening sockets for a given port * @c: Execution context - * @ns: In pasta mode, if set, bind with loopback address in namespace + * @pif: Interface to open the socket for (PIF_HOST or PIF_SPLICE) * @addr: Pointer to address for binding, NULL if not configured * @ifname: Name of interface to bind to, NULL if not configured * @port: Port, host order * * Return: 0 on (partial) success, negative error code on (complete) failure */ -int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr, - const char *ifname, in_port_t port) +int udp_sock_init(const struct ctx *c, uint8_t pif, + const union inany_addr *addr, const char *ifname, + in_port_t port) { union udp_listen_epoll_ref uref = { - .pif = ns ? PIF_SPLICE : PIF_HOST, + .pif = pif, .port = port, }; int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1; + int (*socks)[NUM_PORTS];

ASSERT(!c->no_udp); + ASSERT(pif_is_socket(pif));

- if (!addr && c->ifi4 && c->ifi6 && !ns) { + if (pif == PIF_HOST) + socks = udp_splice_init; + else + socks = udp_splice_ns; + + if (!addr && c->ifi4 && c->ifi6 && pif == PIF_HOST) { int s;

/* Attempt to get a dual stack socket */ s = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST, NULL, ifname, port, uref.u32); - udp_splice_init[V4][port] = s < 0 ? -1 : s; - udp_splice_init[V6][port] = s < 0 ? -1 : s; + socks[V4][port] = s < 0 ? -1 : s; + socks[V6][port] = s < 0 ? -1 : s; if (IN_INTERVAL(0, FD_REF_MAX, s)) return 0; }

if ((!addr || inany_v4(addr)) && c->ifi4) { - if (!ns) { - r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST, - addr ? addr : &inany_any4, ifname, - port, uref.u32); + const union inany_addr *a = addr ? + addr : &inany_any4;

Nit: this fits on a single line.

...

- udp_splice_init[V4][port] = r4 < 0 ? -1 : r4; - } else { - r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE, - &inany_loopback4, ifname, - port, uref.u32); - udp_splice_ns[V4][port] = r4 < 0 ? -1 : r4; - } + if (pif == PIF_SPLICE) + a = &inany_loopback4; + + r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname, + port, uref.u32); + + socks[V4][port] = r4 < 0 ? -1 : r4; }

if ((!addr || !inany_v4(addr)) && c->ifi6) { - if (!ns) { - r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST, - addr ? addr : &inany_any6, ifname, - port, uref.u32); + const union inany_addr *a = addr ? + addr : &inany_any6;

...same here.

...

- udp_splice_init[V6][port] = r6 < 0 ? -1 : r6; - } else { - r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE, - &inany_loopback6, ifname, - port, uref.u32); - udp_splice_ns[V6][port] = r6 < 0 ? -1 : r6; - } + if (pif == PIF_SPLICE) + a = &inany_loopback6; + + r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname, + port, uref.u32); + + socks[V6][port] = r6 < 0 ? -1 : r6; }

if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6)) @@ -1214,7 +1218,8 @@ static void udp_port_rebind(struct ctx *c, bool outbound)

if ((c->ifi4 && socks[V4][port] == -1) || (c->ifi6 && socks[V6][port] == -1)) - udp_sock_init(c, outbound, NULL, NULL, port); + udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST, + NULL, NULL, port); } }

diff --git a/udp.h b/udp.h index 8f8531ad..f78dc528 100644 --- a/udp.h +++ b/udp.h @@ -17,8 +17,9 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, uint8_t ttl, const struct pool *p, int idx, const struct timespec *now); -int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr, - const char *ifname, in_port_t port); +int udp_sock_init(const struct ctx *c, uint8_t pif, + const union inany_addr *addr, const char *ifname, + in_port_t port); int udp_init(struct ctx *c); void udp_timer(struct ctx *c, const struct timespec *now); void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);

-- Stefano

Stefano Brivio

7:33 a.m.

New subject: [PATCH v3 6/8] util: Fix setting of IPV6_V6ONLY socket option

On Wed, 29 Oct 2025 17:26:26 +1100 David Gibson wrote:

...

Currently we only call setsockopt() on IPV6_V6ONLY when we want to set it to 1, which we typically do on all IPv6 sockets except those explicitly for dual stack listening. That's not quite right in two ways:

* Although IPV6_V6ONLY==0 is normally the default on Linux, that can be changed with the net.ipv6.bindv6only sysctl. It may also have different defaults on other OSes if we ever support them. We know we need it off for dual stack sockets, so explicitly set it to 0 in that case.

* At the same time setting IPV6_V6ONLY to 1 for IPv6 sockets bound to a specific address is harmless but pointless. Don't set the option at all in this case, saving a syscall.

I haven't checked the implications of this but __inet6_bind() handles address "types" IPV6_ADDR_ANY and IPV6_ADDR_MAPPED in the same way, for IPV6_V6ONLY purposes. I'm not sure if this has any influence on functionality though.

...

Signed-off-by: David Gibson --- util.c | 27 +++++++++++++++++++++------ 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/util.c b/util.c index c94efae4..62f43895 100644 --- a/util.c +++ b/util.c @@ -45,14 +45,14 @@ * @type: epoll type * @sa: Socket address to bind to * @ifname: Interface for binding, NULL for any - * @v6only: Set IPV6_V6ONLY socket option + * @v6only: If >= 0, set IPV6_V6ONLY socket option to this value * @data: epoll reference portion for protocol handlers * * Return: newly created socket, negative error code on failure */ static int sock_l4_(const struct ctx *c, enum epoll_type type, const union sockaddr_inany *sa, const char *ifname, - bool v6only, uint32_t data) + int v6only, uint32_t data) { sa_family_t af = sa->sa_family; union epoll_ref ref = { .type = type, .data = data }; @@ -101,9 +101,11 @@ static int sock_l4_(const struct ctx *c, enum epoll_type type,

ref.fd = fd;

- if (v6only) - if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &y, sizeof(y))) - debug("Failed to set IPV6_V6ONLY on socket %i", fd); + if (v6only >= 0) + if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, + &v6only, sizeof(v6only))) + debug("Failed to set IPV6_V6ONLY to %d on socket %i", + v6only, fd);

Nit: curly brackets (two pairs) for consistency.

...

if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &y, sizeof(y))) debug("Failed to set SO_REUSEADDR on socket %i", fd); @@ -186,7 +188,16 @@ int sock_l4(const struct ctx *c, enum epoll_type type, const union sockaddr_inany *sa, const char *ifname, uint32_t data) { - return sock_l4_(c, type, sa, ifname, sa->sa_family == AF_INET6, data); + int v6only = -1; + + /* The option doesn't exist for IPv4 sockets, and is irrelevant for IPv6 + * sockets with a non-wildcard address.

Same as above: I don't think this is true, strictly speaking, but I didn't check whether this inaccuracy is in any way relevant.

...

+ */ + if (sa->sa_family == AF_INET6 && + IN6_IS_ADDR_UNSPECIFIED(&sa->sa6.sin6_addr)) + v6only = 1; + + return sock_l4_(c, type, sa, ifname, v6only, data); }

int sock_l4_dualstack(const struct ctx *c, enum epoll_type type, @@ -198,6 +209,10 @@ int sock_l4_dualstack(const struct ctx *c, enum epoll_type type, .sa6.sin6_port = htons(port), };

+ /* Dual stack sockets require IPV6_V6ONLY == 0. Usually that's the + * default, but sysctl net.ipv6.bindv6only can change that, so set the + * sockopt explicitly. + */ return sock_l4_(c, type, &sa, ifname, 0, data); }

The rest of the series looks good to me, including 8/8, but it's not clear to me what the "secondary reasons" to consider 8/8 at this stage might be, so I'm not actually sure what to do with it. Is it because it drops some code? -- Stefano

David Gibson

14 Nov 14 Nov

1:24 a.m.

New subject: [PATCH v3 6/8] util: Fix setting of IPV6_V6ONLY socket option

On Thu, Nov 13, 2025 at 07:33:35AM +0100, Stefano Brivio wrote:

...

On Wed, 29 Oct 2025 17:26:26 +1100 David Gibson wrote:

...
Currently we only call setsockopt() on IPV6_V6ONLY when we want to set it to 1, which we typically do on all IPv6 sockets except those explicitly for dual stack listening. That's not quite right in two ways:

* Although IPV6_V6ONLY==0 is normally the default on Linux, that can be changed with the net.ipv6.bindv6only sysctl. It may also have different defaults on other OSes if we ever support them. We know we need it off for dual stack sockets, so explicitly set it to 0 in that case.

* At the same time setting IPV6_V6ONLY to 1 for IPv6 sockets bound to a specific address is harmless but pointless. Don't set the option at all in this case, saving a syscall.

I haven't checked the implications of this but __inet6_bind() handles address "types" IPV6_ADDR_ANY and IPV6_ADDR_MAPPED in the same way, for IPV6_V6ONLY purposes. I'm not sure if this has any influence on functionality though.

AFAICT, technically yes, but not in a way that really matters. IIUC for a v4-mapped address using IPV6_V6ONLY=0 will let the socket handle IPv4 traffic with the corresponding address. IPV6_V6ONLY will only handle IPv6 traffic that actually has the v4-mapped address on it. We won't actually get to the point of passing v4-mapped addresses to the kernel in passt/pasta, because we already use those to mean "IPv4" and will go to the IPv4 paths.

...

...
Signed-off-by: David Gibson --- util.c | 27 +++++++++++++++++++++------ 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/util.c b/util.c index c94efae4..62f43895 100644 --- a/util.c +++ b/util.c @@ -45,14 +45,14 @@ * @type: epoll type * @sa: Socket address to bind to * @ifname: Interface for binding, NULL for any - * @v6only: Set IPV6_V6ONLY socket option + * @v6only: If >= 0, set IPV6_V6ONLY socket option to this value * @data: epoll reference portion for protocol handlers * * Return: newly created socket, negative error code on failure */ static int sock_l4_(const struct ctx *c, enum epoll_type type, const union sockaddr_inany *sa, const char *ifname, - bool v6only, uint32_t data) + int v6only, uint32_t data) { sa_family_t af = sa->sa_family; union epoll_ref ref = { .type = type, .data = data }; @@ -101,9 +101,11 @@ static int sock_l4_(const struct ctx *c, enum epoll_type type,

ref.fd = fd;

- if (v6only) - if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &y, sizeof(y))) - debug("Failed to set IPV6_V6ONLY on socket %i", fd); + if (v6only >= 0) + if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, + &v6only, sizeof(v6only))) + debug("Failed to set IPV6_V6ONLY to %d on socket %i", + v6only, fd);

Nit: curly brackets (two pairs) for consistency.

Fixed.

...

...
if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &y, sizeof(y))) debug("Failed to set SO_REUSEADDR on socket %i", fd); @@ -186,7 +188,16 @@ int sock_l4(const struct ctx *c, enum epoll_type type, const union sockaddr_inany *sa, const char *ifname, uint32_t data) { - return sock_l4_(c, type, sa, ifname, sa->sa_family == AF_INET6, data); + int v6only = -1; + + /* The option doesn't exist for IPv4 sockets, and is irrelevant for IPv6 + * sockets with a non-wildcard address.

Same as above: I don't think this is true, strictly speaking, but I didn't check whether this inaccuracy is in any way relevant.

Right, so technically yes, but I don't think it's relevant for the reasons I gave above. I've rephrased to "we don't care about it for IPv6 sockets with ...". Does that help?

...

...
+ */ + if (sa->sa_family == AF_INET6 && + IN6_IS_ADDR_UNSPECIFIED(&sa->sa6.sin6_addr)) + v6only = 1; + + return sock_l4_(c, type, sa, ifname, v6only, data); }

int sock_l4_dualstack(const struct ctx *c, enum epoll_type type, @@ -198,6 +209,10 @@ int sock_l4_dualstack(const struct ctx *c, enum epoll_type type, .sa6.sin6_port = htons(port), };

+ /* Dual stack sockets require IPV6_V6ONLY == 0. Usually that's the + * default, but sysctl net.ipv6.bindv6only can change that, so set the + * sockopt explicitly. + */ return sock_l4_(c, type, &sa, ifname, 0, data); }

The rest of the series looks good to me, including 8/8, but it's not clear to me what the "secondary reasons" to consider 8/8 at this stage might be, so I'm not actually sure what to do with it.

Is it because it drops some code?

* It drops some code * I think it will address bug 113 (still need to check) * It might also help for some other systemd-resolved edge cases * It will make life a bit easier to implement bug 171 * I think the improved symmetry will make other flexible forwarding changes a bit easier I'll look into bug 113 and revise the commit message for 8/8 in the next spin. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

12:33 a.m.

New subject: [PATCH v3 4/8] udp: Unify some more inbound/outbound parts of udp_sock_init()

On Thu, Nov 13, 2025 at 07:33:26AM +0100, Stefano Brivio wrote:

...

On Wed, 29 Oct 2025 17:26:24 +1100 David Gibson wrote:

...
udp_sock_init() takes an 'ns' parameter determining if it creates a socket in the guest namespace or host namespace. Alter it to take a pif parameter instead, like tcp_sock_init(), and use that change to slightly reduce code duplication between the HOST and SPLICE cases.

Signed-off-by: David Gibson --- conf.c | 2 +- udp.c | 65 +++++++++++++++++++++++++++++++--------------------------- udp.h | 5 +++-- 3 files changed, 39 insertions(+), 33 deletions(-)

diff --git a/conf.c b/conf.c index 26f1bcc0..08cb50aa 100644 --- a/conf.c +++ b/conf.c @@ -171,7 +171,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname, if (optname == 't') ret = tcp_sock_init(c, PIF_HOST, addr, ifname, i); else if (optname == 'u') - ret = udp_sock_init(c, 0, addr, ifname, i); + ret = udp_sock_init(c, PIF_HOST, addr, ifname, i); else /* No way to check in advance for -T and -U */ ret = 0; diff --git a/udp.c b/udp.c index 8f96495d..f3237436 100644 --- a/udp.c +++ b/udp.c @@ -1091,64 +1091,68 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, /** * udp_sock_init() - Initialise listening sockets for a given port * @c: Execution context - * @ns: In pasta mode, if set, bind with loopback address in namespace + * @pif: Interface to open the socket for (PIF_HOST or PIF_SPLICE) * @addr: Pointer to address for binding, NULL if not configured * @ifname: Name of interface to bind to, NULL if not configured * @port: Port, host order * * Return: 0 on (partial) success, negative error code on (complete) failure */ -int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr, - const char *ifname, in_port_t port) +int udp_sock_init(const struct ctx *c, uint8_t pif, + const union inany_addr *addr, const char *ifname, + in_port_t port) { union udp_listen_epoll_ref uref = { - .pif = ns ? PIF_SPLICE : PIF_HOST, + .pif = pif, .port = port, }; int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1; + int (*socks)[NUM_PORTS];

ASSERT(!c->no_udp); + ASSERT(pif_is_socket(pif));

- if (!addr && c->ifi4 && c->ifi6 && !ns) { + if (pif == PIF_HOST) + socks = udp_splice_init; + else + socks = udp_splice_ns; + + if (!addr && c->ifi4 && c->ifi6 && pif == PIF_HOST) { int s;

/* Attempt to get a dual stack socket */ s = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST, NULL, ifname, port, uref.u32); - udp_splice_init[V4][port] = s < 0 ? -1 : s; - udp_splice_init[V6][port] = s < 0 ? -1 : s; + socks[V4][port] = s < 0 ? -1 : s; + socks[V6][port] = s < 0 ? -1 : s; if (IN_INTERVAL(0, FD_REF_MAX, s)) return 0; }

if ((!addr || inany_v4(addr)) && c->ifi4) { - if (!ns) { - r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST, - addr ? addr : &inany_any4, ifname, - port, uref.u32); + const union inany_addr *a = addr ? + addr : &inany_any4;

Nit: this fits on a single line.

Fixed. Probably a relic from an earlier draft.

...

...
- udp_splice_init[V4][port] = r4 < 0 ? -1 : r4; - } else { - r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE, - &inany_loopback4, ifname, - port, uref.u32); - udp_splice_ns[V4][port] = r4 < 0 ? -1 : r4; - } + if (pif == PIF_SPLICE) + a = &inany_loopback4; + + r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname, + port, uref.u32); + + socks[V4][port] = r4 < 0 ? -1 : r4; }

if ((!addr || !inany_v4(addr)) && c->ifi6) { - if (!ns) { - r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST, - addr ? addr : &inany_any6, ifname, - port, uref.u32); + const union inany_addr *a = addr ? + addr : &inany_any6;

...same here.

Fixed.

...

...
- udp_splice_init[V6][port] = r6 < 0 ? -1 : r6; - } else { - r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE, - &inany_loopback6, ifname, - port, uref.u32); - udp_splice_ns[V6][port] = r6 < 0 ? -1 : r6; - } + if (pif == PIF_SPLICE) + a = &inany_loopback6; + + r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname, + port, uref.u32); + + socks[V6][port] = r6 < 0 ? -1 : r6; }

if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6)) @@ -1214,7 +1218,8 @@ static void udp_port_rebind(struct ctx *c, bool outbound)

if ((c->ifi4 && socks[V4][port] == -1) || (c->ifi6 && socks[V6][port] == -1)) - udp_sock_init(c, outbound, NULL, NULL, port); + udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST, + NULL, NULL, port); } }

diff --git a/udp.h b/udp.h index 8f8531ad..f78dc528 100644 --- a/udp.h +++ b/udp.h @@ -17,8 +17,9 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, uint8_t ttl, const struct pool *p, int idx, const struct timespec *now); -int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr, - const char *ifname, in_port_t port); +int udp_sock_init(const struct ctx *c, uint8_t pif, + const union inany_addr *addr, const char *ifname, + in_port_t port); int udp_init(struct ctx *c); void udp_timer(struct ctx *c, const struct timespec *now); void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);

-- Stefano

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

18 Nov 18 Nov

1:19 a.m.

New subject: [PATCH v3 6/8] util: Fix setting of IPV6_V6ONLY socket option

On Fri, 14 Nov 2025 11:24:31 +1100 David Gibson wrote:

...

On Thu, Nov 13, 2025 at 07:33:35AM +0100, Stefano Brivio wrote:

...
On Wed, 29 Oct 2025 17:26:26 +1100 David Gibson wrote:

...
Currently we only call setsockopt() on IPV6_V6ONLY when we want to set it to 1, which we typically do on all IPv6 sockets except those explicitly for dual stack listening. That's not quite right in two ways:

* Although IPV6_V6ONLY==0 is normally the default on Linux, that can be changed with the net.ipv6.bindv6only sysctl. It may also have different defaults on other OSes if we ever support them. We know we need it off for dual stack sockets, so explicitly set it to 0 in that case.

* At the same time setting IPV6_V6ONLY to 1 for IPv6 sockets bound to a specific address is harmless but pointless. Don't set the option at all in this case, saving a syscall.

I haven't checked the implications of this but __inet6_bind() handles address "types" IPV6_ADDR_ANY and IPV6_ADDR_MAPPED in the same way, for IPV6_V6ONLY purposes. I'm not sure if this has any influence on functionality though.

AFAICT, technically yes, but not in a way that really matters. IIUC for a v4-mapped address using IPV6_V6ONLY=0 will let the socket handle IPv4 traffic with the corresponding address. IPV6_V6ONLY will only handle IPv6 traffic that actually has the v4-mapped address on it.

We won't actually get to the point of passing v4-mapped addresses to the kernel in passt/pasta, because we already use those to mean "IPv4" and will go to the IPv4 paths.

...
...
Signed-off-by: David Gibson --- util.c | 27 +++++++++++++++++++++------ 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/util.c b/util.c index c94efae4..62f43895 100644 --- a/util.c +++ b/util.c @@ -45,14 +45,14 @@ * @type: epoll type * @sa: Socket address to bind to * @ifname: Interface for binding, NULL for any - * @v6only: Set IPV6_V6ONLY socket option + * @v6only: If >= 0, set IPV6_V6ONLY socket option to this value * @data: epoll reference portion for protocol handlers * * Return: newly created socket, negative error code on failure */ static int sock_l4_(const struct ctx *c, enum epoll_type type, const union sockaddr_inany *sa, const char *ifname, - bool v6only, uint32_t data) + int v6only, uint32_t data) { sa_family_t af = sa->sa_family; union epoll_ref ref = { .type = type, .data = data }; @@ -101,9 +101,11 @@ static int sock_l4_(const struct ctx *c, enum epoll_type type,

ref.fd = fd;

- if (v6only) - if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &y, sizeof(y))) - debug("Failed to set IPV6_V6ONLY on socket %i", fd); + if (v6only >= 0) + if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, + &v6only, sizeof(v6only))) + debug("Failed to set IPV6_V6ONLY to %d on socket %i", + v6only, fd);

Nit: curly brackets (two pairs) for consistency.

Fixed.

...
...
if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &y, sizeof(y))) debug("Failed to set SO_REUSEADDR on socket %i", fd); @@ -186,7 +188,16 @@ int sock_l4(const struct ctx *c, enum epoll_type type, const union sockaddr_inany *sa, const char *ifname, uint32_t data) { - return sock_l4_(c, type, sa, ifname, sa->sa_family == AF_INET6, data); + int v6only = -1; + + /* The option doesn't exist for IPv4 sockets, and is irrelevant for IPv6 + * sockets with a non-wildcard address.

Same as above: I don't think this is true, strictly speaking, but I didn't check whether this inaccuracy is in any way relevant.

Right, so technically yes, but I don't think it's relevant for the reasons I gave above. I've rephrased to "we don't care about it for IPv6 sockets with ...". Does that help?

Yes, it does, thanks.

...

...
...
+ */ + if (sa->sa_family == AF_INET6 && + IN6_IS_ADDR_UNSPECIFIED(&sa->sa6.sin6_addr)) + v6only = 1; + + return sock_l4_(c, type, sa, ifname, v6only, data); }

int sock_l4_dualstack(const struct ctx *c, enum epoll_type type, @@ -198,6 +209,10 @@ int sock_l4_dualstack(const struct ctx *c, enum epoll_type type, .sa6.sin6_port = htons(port), };

+ /* Dual stack sockets require IPV6_V6ONLY == 0. Usually that's the + * default, but sysctl net.ipv6.bindv6only can change that, so set the + * sockopt explicitly. + */ return sock_l4_(c, type, &sa, ifname, 0, data); }

The rest of the series looks good to me, including 8/8, but it's not clear to me what the "secondary reasons" to consider 8/8 at this stage might be, so I'm not actually sure what to do with it.

Is it because it drops some code?

* It drops some code * I think it will address bug 113 (still need to check)

Right, I forgot about this part.

...

* It might also help for some other systemd-resolved edge cases * It will make life a bit easier to implement bug 171 * I think the improved symmetry will make other flexible forwarding changes a bit easier

I'll look into bug 113 and revise the commit message for 8/8 in the next spin.

-- Stefano

104

Age (days ago)

124

Last active (days ago)

List overview

Download

11 comments

2 participants

participants (2)

David Gibson
Stefano Brivio

[PATCH v3 0/8] Reduce differences between inbound and outbound socket binding

tags

participants (2)