[PATCH v4 0/9] Reduce differences between inbound and outbound socket binding
This series is based on my series fixing bug 176 (regression in auto forwarding). The fact that outbound forwarding sockets are bound to the loopback address, whereas inbound forwarding sockets are (by default) bound to the unspecified address leads to some unexpected differences between the paths setting up each of them. An idea for tackling bug 100 suggested a different approach which will also reduce some of those differences and allow more code to be shared between the two paths. I've since discovered that this approach doesn't help for bug 100, but I think it's still worthwhile for other reasons. Patches 1..7/9 are cleanups which shouldn't change behaviour, and I think are ready to merge. 8/9 and 9/9 introduce behavioural changes, but I've made the case for them in the patch comments. v4: - Add cleanup patch removing unused structure field - Rebase, fixing conflicts with Laurent's changes - A bunch of spelling and other cosmetic fixes - Clarify relation to bug 100 and bug 113 v3: - A number of additional fixes covering the handling of IPV6_V6ONLY sockopt - Assorted trivial changes v2: - Some rearrangements and rewordings for clarity David Gibson (9): flow: Remove bogus @path field from flowside_sock_args inany: Let length of sockaddr_inany be implicit from the family util, flow, pif: Simplify sock_l4_sa() interface tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one() udp: Unify some more inbound/outbound parts of udp_sock_init() udp: Move udp_sock_init() special case to its caller util: Fix setting of IPV6_V6ONLY socket option tcp, udp: Remove fallback if creating dual stack socket fails tcp, udp: Bind outbound listening sockets by interface instead of address conf.c | 4 +- flow.c | 22 +++---- icmp.c | 3 +- inany.h | 17 ++++++ pif.c | 27 ++------- pif.h | 2 +- tcp.c | 164 +++++++++++++++++++-------------------------------- tcp.h | 5 +- tcp_splice.c | 5 +- udp.c | 102 +++++++++++++++----------------- udp.h | 5 +- util.c | 76 ++++++++++++++++++++---- util.h | 8 ++- 13 files changed, 221 insertions(+), 219 deletions(-) -- 2.51.1
Surprisingly little logic is shared between the path for creating a
listen()ing socket in the guest namespace versus in the host namespace.
Improve this, by extending tcp_sock_init_one() to take a pif parameter
indicating where it should open the socket. This allows
tcp_ns_sock_init[46]() to be removed entirely.
We generalise tcp_sock_init() in the same way, although we don't fully use
it yet, due to some subtle differences in how we bind for -t versus -T.
Signed-off-by: David Gibson
This was there when the structure was created, but never used. Looks like
a copy-pasta error.
Fixes: 781164e25bdf ("flow: Helper to create sockets based on flowside")
Signed-off-by: David Gibson
sock_l4_sa() has a somewhat confusing 'v6only' option controlling whether
to set the IPV6_V6ONLY socket option. Usually it's set when the given
address is IPv6, but not when we want to create a dual stack listening
socket. The latter only makes sense when the address is :: however.
Clarify this by only keeping the v6only option in an internal helper
sock_l4_(). External users will call either sock_l4() which always creates
a socket bound to a specific IP version, or sock_l4_dualstack() which
creates a dual stack socket, but takes only a port not an address.
We drop the '_sa' suffix while we're at it - it exists because this used
to be an internal version with a sock_l4() wrapper. The wrapper no longer
exists so the '_sa' is no longer useful.
Signed-off-by: David Gibson
Inbound sockets are bound to the unspecified address by default, whereas
outbound sockets are bound to the loopback address. This is currently
handled in udp_sock_init() which ignores its addr argument in the outbound
case.
Move the handling of this special case to the caller, udp_port_rebind().
This makes the semantics of the 'addr' argument more consistent, and will
make future changes easier.
Signed-off-by: David Gibson
udp_sock_init() takes an 'ns' parameter determining if it creates a socket
in the guest namespace or host namespace. Alter it to take a pif
parameter instead, like tcp_sock_init(), and use that change to slightly
reduce code duplication between the HOST and SPLICE cases.
Signed-off-by: David Gibson
To save kernel memory we try to use "dual stack" sockets which can listen
for both IPv4 and IPv6 connections where possible. To support kernels
which don't allow dual stack sockets, we fall back to creating individual
sockets for IPv4 and IPv6.
This fallback causes some mild ugliness now, and will cause more difficulty
with upcoming improvements to the forwarding logic. I don't think we need
the fallback on the following grounds:
1) The fallback was broken since inception:
The fallback was triggered if pif_sock_l4() failed attempting to create the
dual stack socket. But even if the kernel didn't support them,
pif_sock_l4() would not report a failure.
- Dual stack sockets are distinguished by having the IPV6_V6ONLY sockopt
set to 0. However, until the last patch, we only called setsockopt()
if we wanted to set this to 1, so there was no kernel operation which
could fail for dual stack sockets - we'd silently create a IPv6 only
socket instead.
- Even if we did call the setsockopt(), we only printed a debug() message
for failures, we didn't report it to the caller
2) Dual stack sockets are not just a Linux extension
The dual stack socket interface is described in RFC3493, specifically
section 3.7 and section 5.3. It is supported on BSD:
https://man.freebsd.org/cgi/man.cgi?query=ip6
and on Windows:
https://learn.microsoft.com/en-us/windows/win32/winsock/ipproto-ipv6-socket-...
3) Linux has supported dual stack sockets for over 20 years
According to ipv6(7) the IPV6_V6ONLY socket option was introduced in
Linux 2.6 and Linux 2.4.21 both from 2003.
Signed-off-by: David Gibson
Currently, outbound forwards (-T, -U) are handled by sockets bound to the
loopback address. Typically we create two sockets, one for 127.0.0.1 and
one for ::1.
This has some disadvantages:
* The guest can't connect via 127.0.0.0/8 addresses other than 127.0.0.1
* We can't use dual-stack sockets, we have to have separate sockets for
IPv4 and IPv6.
The restriction exists for a reason though. If the guest has any
interfaces other than pasta (e.g. a VPN tunnel) external hosts could reach
the host via the forwards. Especially combined with -T auto / -U auto this
would make it very easy to make a mistake with nasty security implications.
We can achieve this a different way, however. Don't bind to a specific
address, but _do_ use SO_BINDTODEVICE to restrict the sockets to the "lo"
interface.
Note that although traffic to a local but non-loopback address is passed
over the 'lo' interface (as seen by netfilter and dumpcap), it doesn't
count as attached to that interface for the purposes of SO_BINDTODEVICE
(information from the routing table overrides the "physical" interface).
So, this change doesn't help for bug 100.
It's also not a complete fix for bug 113, it does however:
* Get us a step closer to fixing bug 113
* Slightly simplify the code
* Make things a bit easier to allow more flexible binding on the guest in
in future
Link: https://bugs.passt.top/show_bug.cgi?id=100
Link: https://bugs.passt.top/show_bug.cgi?id=113
Signed-off-by: David Gibson
Currently we only call setsockopt() on IPV6_V6ONLY when we want to set it
to 1, which we typically do on all IPv6 sockets except those explicitly for
dual stack listening. That's not quite right in two ways:
* Although IPV6_V6ONLY==0 is normally the default on Linux, that can be
changed with the net.ipv6.bindv6only sysctl. It may also have different
defaults on other OSes if we ever support them. We know we need it off
for dual stack sockets, so explicitly set it to 0 in that case.
* At the same time setting IPV6_V6ONLY to 1 for IPv6 sockets bound to a
specific address is harmless but pointless. Don't set the option at all
in this case, saving a syscall.
Signed-off-by: David Gibson
sockaddr_inany can contain either an IPv4 or IPv6 socket address, so the
relevant length for bind() or connect() can vary. In pif_sockaddr() we
return that length, and in sock_l4_sa() we take it as an extra parameter.
However, sockaddr_inany always contains exactly a sockaddr_in or a
sockaddr_in6 each with a fixed size. Therefore we can derive the relevant
length from the family, and don't need to pass it around separately.
Make a tiny helper to get the relevant address length, and update all
interfaces to use that approach instead.
In the process, fix a buglet in tcp_flow_repair_bind(): we passed
sizeof(union sockaddr_inany) to bind() instead of the specific length for
the address family. Since the sizeof() is always longer than the specific
length, this is probably fine, but not theoretically correct.
Signed-off-by: David Gibson
The series looks good to me in general, except that:
On Wed, 19 Nov 2025 16:22:57 +1100
David Gibson
Currently, outbound forwards (-T, -U) are handled by sockets bound to the loopback address. Typically we create two sockets, one for 127.0.0.1 and one for ::1.
This has some disadvantages: * The guest can't connect via 127.0.0.0/8 addresses other than 127.0.0.1 * We can't use dual-stack sockets, we have to have separate sockets for IPv4 and IPv6.
The restriction exists for a reason though. If the guest has any interfaces other than pasta (e.g. a VPN tunnel) external hosts could reach the host via the forwards. Especially combined with -T auto / -U auto this would make it very easy to make a mistake with nasty security implications.
We can achieve this a different way, however. Don't bind to a specific address, but _do_ use SO_BINDTODEVICE to restrict the sockets to the "lo" interface.
...this means, as I pointed out on: https://archives.passt.top/passt-dev/20251022105916.53925523@elisabeth/ that we might break functionality for a number of pasta(1) users. I don't have a complete version of the SO_BINDTODEVICE fallback I sketched there, so I can't just add one on top of this series at the moment, but we need something like that before I can merge this. -- Stefano
On Fri, Nov 21, 2025 at 04:56:01AM +0100, Stefano Brivio wrote:
The series looks good to me in general, except that:
On Wed, 19 Nov 2025 16:22:57 +1100 David Gibson
wrote: Currently, outbound forwards (-T, -U) are handled by sockets bound to the loopback address. Typically we create two sockets, one for 127.0.0.1 and one for ::1.
This has some disadvantages: * The guest can't connect via 127.0.0.0/8 addresses other than 127.0.0.1 * We can't use dual-stack sockets, we have to have separate sockets for IPv4 and IPv6.
The restriction exists for a reason though. If the guest has any interfaces other than pasta (e.g. a VPN tunnel) external hosts could reach the host via the forwards. Especially combined with -T auto / -U auto this would make it very easy to make a mistake with nasty security implications.
We can achieve this a different way, however. Don't bind to a specific address, but _do_ use SO_BINDTODEVICE to restrict the sockets to the "lo" interface.
...this means, as I pointed out on:
https://archives.passt.top/passt-dev/20251022105916.53925523@elisabeth/
that we might break functionality for a number of pasta(1) users.
I don't have a complete version of the SO_BINDTODEVICE fallback I sketched there, so I can't just add one on top of this series at the moment, but we need something like that before I can merge this.
Yes, I was intending to make that change, but just forgot. I'll fix this patch and resend. Are you intending to merge the rest of the series, or should I respin it? -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 21 Nov 2025 16:24:20 +1100
David Gibson
On Fri, Nov 21, 2025 at 04:56:01AM +0100, Stefano Brivio wrote:
The series looks good to me in general, except that:
On Wed, 19 Nov 2025 16:22:57 +1100 David Gibson
wrote: Currently, outbound forwards (-T, -U) are handled by sockets bound to the loopback address. Typically we create two sockets, one for 127.0.0.1 and one for ::1.
This has some disadvantages: * The guest can't connect via 127.0.0.0/8 addresses other than 127.0.0.1 * We can't use dual-stack sockets, we have to have separate sockets for IPv4 and IPv6.
The restriction exists for a reason though. If the guest has any interfaces other than pasta (e.g. a VPN tunnel) external hosts could reach the host via the forwards. Especially combined with -T auto / -U auto this would make it very easy to make a mistake with nasty security implications.
We can achieve this a different way, however. Don't bind to a specific address, but _do_ use SO_BINDTODEVICE to restrict the sockets to the "lo" interface.
...this means, as I pointed out on:
https://archives.passt.top/passt-dev/20251022105916.53925523@elisabeth/
that we might break functionality for a number of pasta(1) users.
I don't have a complete version of the SO_BINDTODEVICE fallback I sketched there, so I can't just add one on top of this series at the moment, but we need something like that before I can merge this.
Yes, I was intending to make that change, but just forgot. I'll fix this patch and resend. Are you intending to merge the rest of the series, or should I respin it?
I guess it's marginally easier if you respin so that we have a single link to the series, eventually, and I would run tests on the series as a whole, not just up to 8/9. -- Stefano
On Fri, Nov 21, 2025 at 04:56:01AM +0100, Stefano Brivio wrote:
The series looks good to me in general, except that:
On Wed, 19 Nov 2025 16:22:57 +1100 David Gibson
wrote: Currently, outbound forwards (-T, -U) are handled by sockets bound to the loopback address. Typically we create two sockets, one for 127.0.0.1 and one for ::1.
This has some disadvantages: * The guest can't connect via 127.0.0.0/8 addresses other than 127.0.0.1 * We can't use dual-stack sockets, we have to have separate sockets for IPv4 and IPv6.
The restriction exists for a reason though. If the guest has any interfaces other than pasta (e.g. a VPN tunnel) external hosts could reach the host via the forwards. Especially combined with -T auto / -U auto this would make it very easy to make a mistake with nasty security implications.
We can achieve this a different way, however. Don't bind to a specific address, but _do_ use SO_BINDTODEVICE to restrict the sockets to the "lo" interface.
...this means, as I pointed out on:
https://archives.passt.top/passt-dev/20251022105916.53925523@elisabeth/
that we might break functionality for a number of pasta(1) users.
I don't have a complete version of the SO_BINDTODEVICE fallback I sketched there, so I can't just add one on top of this series at the moment, but we need something like that before I can merge this.
I re-examined your proposed approach, but realised it doesn't quite work. The problem is that to complete it, sock_l4_sa() would need to create both an IPv4 and IPv6. That works right now, but it breaks the assumption that tcp_sock_init() and udp_sock_init() create (at most) a single socket. That wasn't the case until 8/9 in this series, but part of the reason for 8/9 is because establishing that invariant makes a bunch of stuff in the works much saner. So, I'm working to figure out a different approach for an SO_BINDTODEVICE fallback. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (2)
-
David Gibson
-
Stefano Brivio