[PATCH v3 00/14] Introduce forwarding table
This creates a new data structure for recording our port forwarding: a table of individual forwarding rules. This is intended to replace the existing forwarding mode, map and delta structures. For now, only the delta[] array is removed. map is still used, but only for automatic forwards. Mode is still used, but only has significance during configuration. There's still a lot that can be done for flexible forwarding, but this introduces the core data structure, and does enough to fix at least one concrete defect with the current logic (bug 187). I think it would be a good point to give it a solid review and maybe merge. I'm aware that this is a very large complex series, that will be difficult to review. I think that's more or less inevitable implementing a feature as broad and complex as "flexible forwarding". If it helps at all, I'm fine if it's not reviewed all at once - I won't assume review is complete until either you say so, or I get comments or R-b on each patch. This applies on top of all 3 of my outstanding series in this area: "Allow listen functions to return fds", "Clean ups to epoll references" (v2) and "Small cleanups to splice forwarding logic". David Gibson (14): inany: Extend inany_ntop() to treat NULL as a fully unspecified address conf, fwd: Keep a table of our port forwarding configuration conf: Accurately record ifname and address for outbound forwards conf, fwd: Record "auto" port forwards in forwarding table fwd: Make space to store listening sockets in forward table ip: Add ipproto_name() function fwd, tcp, udp: Set up listening sockets based on forward table tcp, udp: Remove old auto-forwarding socket arrays conf, fwd: Check forwarding table for conflicting rules fwd: Generate auto-forward exclusions from socket fd tables flow, fwd: Consult rules table when forwarding a new flow from socket fwd: Remap ports based directly on forwarding rule fwd, tcp, udp: Add forwarding rule to listening socket epoll references flow, fwd: Optimise forwarding rule lookup using epoll ref when possible conf.c | 118 +++++++++------ flow.c | 59 ++++++-- flow_table.h | 2 +- fwd.c | 379 ++++++++++++++++++++++++++++++++++++++++++++++--- fwd.h | 68 ++++++++- icmp.c | 2 +- inany.c | 28 +++- inany.h | 1 + ip.c | 26 ++++ ip.h | 2 + tcp.c | 154 ++------------------ tcp.h | 6 +- udp.c | 147 +++---------------- udp.h | 7 +- udp_flow.c | 14 +- udp_flow.h | 2 +- udp_internal.h | 4 +- 17 files changed, 641 insertions(+), 378 deletions(-) -- 2.52.0
At present, we set up forwarding as we parse the -t, and -u options, not
keeping a persistent data structure with all the details. We do have
some information in struct fwd_ports, but it doesn't capture all the nuance
that the options do.
As a first step to generalising our forwarding model, add a table of all
the forwarding rules to struct fwd_ports. For now it covers only explicit
forwards, not automatic, and we don't do anything with it other than print
some additional debug information. We'll do more with it in future
patches.
Signed-off-by: David Gibson
-T and -U options don't allow specifying a listening address. Usually this
will listen on *%lo in the guest. However on kernels without unprivileged
SO_BINDTODEVICE that's not possible so we instead listen separately on
127.0.0.1 and ::1.
Currently that's handled at the point we actually set up the listens,
we record both address and ifname as NULL in the forwarding table
entry. That will cause trouble for future extensions we want, so
update this to accurately create the forwarding table: either a single
rule with ifname == "lo" or two rules with addresses of 127.0.0.1 and
::1.
As a bonus, this gives the user a warning if they specify an explicit
outbound forwarding on a kernel without SO_BINDTODEVICE. The existing
warning for missing SO_BINDTODEVICE incorrectly covered only the case
of -T auto or -U auto.
Signed-off-by: David Gibson
Currently the forwarding table records details of explicit port forwards,
but nothing for -[tTuU] auto. That looks a little odd on the debug output,
and will be a problem for future changes.
Extend the forward table to have rules for auto-scanned forwards,
using a new FWD_SCAN flag. For now the mechanism of auto port
forwarding isn't updated, and we will only create a single FWD_SCAN
rule per protocol and direction. We'll better integrate auto scanning
with other forward table mechanics in future.
Signed-off-by: David Gibson
Add a function to get the name of an IP protocol from its number. Usually
this would be done by getprotobynumber(), but that requires access to
/etc/protocols and might allocate. We can't do either of those once we've
self-isolated.
Signed-off-by: David Gibson
Now that we've moved listening socket management to the new forwarding
table data structure, the existing arrays of socket fds are maintained,
but never consulted. Remove them.
Signed-off-by: David Gibson
Previously we created inbound listening sockets as we parsed the forwarding
options (-t, -u) whereas outbound listening sockets were created during
{tcp,udp}_init(). Now that we have a data structure recording the full
details of the listening options we can move all listening socket creation
to {tcp,udp}_init(). This means that errors for either direction are
detected and reported the same way.
Introduce fwd_listen_sync() which synchronizes the state of listening
sockets to the forwarding rules table, both for fixed and automatic
forwards.
This does cause a change in semantics for "exclude only" port
specifications. Previously an option like -t ~6000 wouldn't cause a
fatal error, as long as we could bind at least one port. Now, it
requires at least one port for each generated rule; that is for each
of the contiguous blocks of ports the specification resolves to. With
typical ephemeral ports settings that's one port each in 1..5999,
6001..32767 and 61000..65535.
Preserving the exact behaviour for this case would require a considerably
more complex data structure, so I'm hoping this is a sufficiently niche
case for the change to be acceptable.
Signed-off-by: David Gibson
When auto-forwarding based on port scans, we must exclude our own
listening ports, to avoid circular forwards. Currently we use the
(previous value of the) forwarding bitmaps for the reverse direction
to determine that.
Instead, generate it from the tables of listening sockets that we now
maintain. For now this seems like a lot more work to get to the same
place. However, it does mean we're basing our exclusions directly on the
relevant information: which of the scanned listens belong to us. More
importantly, it's a step towards removing the bitmaps entirely.
Signed-off-by: David Gibson
At present, we don't keep track of the fds for listening sockets (except
for "auto" ones). Since the fd is stored in the epoll reference, we didn't
need an alternative source of it for the various handlers.
However, we're intending to allow dynamic changes to forwarding
configuration in future. That means we need a way to enumerate sockets so
we can close them on removal of a forward.
Extend our forwarding table data structure to make space for all the
listening sockets. To avoid allocation, this imposes another limit:
we could run out of space for socket fds before we run out of slots
for forwarding rules.
We don't actually do anything with the allocate spaced yet. For
"auto" forwards it's redundant with existing arrays. We'll fix both
of those in later patches.
Signed-off-by: David Gibson
It's possible for a user to supply conflicting forwarding parameters, e.g.
$ pasta -t 80:8080 -t 127.0.0.1/80:8888
We give a warning in this case, but it's based on the legacy
forwarding bitmaps. This is too strict, because it will also warn on
cases that shouldn't conflict because they use different addresses,
e.g.
$ pasta -t 192.0.2.1/80:8080 127.0.0.1/80:8888
Theoretically, it's also too loose because it won't take into account
auto-scan forwarding rules. We can't hit that in practice now,
because we only ever have one auto-scan rule and nothing else, but we
want to remove that restriction in future.
Replace the bitmap based check with a check based on actually scanning
the forwarding rules for conflicts.
Signed-off-by: David Gibson
We now have a formal array of forwarding rules. However, we don't actually
consult it when we forward a new flow. Instead we rely on (a) implicit
information (we wouldn't be here if there wasn't a listening socket for the
rule) and (b) the legacy delta[] data structure.
Start addressing this, by searching for a matching forwarding rule when
attempting to forward a new flow. For now this is incomplete:
* We only do this for socket-initiated flows
* We make a potentially costly linear lookup
* We don't actually use the matching rule for anything yet
We'll address each of those in later patches.
Signed-off-by: David Gibson
Currently we remap port numbers based on the legacy delta[] array, which
is indexed only by original port number, not the listening address. Now
that we look up a forwarding rule entry in flow_target(), we can use this
entry to directly determine the correct remapped port. Implement this,
and remove the old delta[] array.
Link: https://bugs.passt.top/show_bug.cgi?id=187
Signed-off-by: David Gibson
Now that we have a table of all our forwarding rules, every listening
socket can be associated with a specific rule. Add an index allowing us to
locate that rule from the socket's epoll reference. We don't use it yet,
but we'll use it to optimise rule lookup when forwarding new flows.
Signed-off-by: David Gibson
Now that listening sockets include a reference to the forwarding rule which
created them we can, in many cases, avoid a linear search of the forwarding
table when we want to find the relevant rule. Instead we can take the
rule index from the socket's epoll reference, and use that to immediately
find the correct rule.
This is conceptually simple, but requires a moderate amount of plumbing to
get the index from the reference through to the rule lookup. We still
allow fall back to linear search if we don't have the index, and this may
(rarely) be used in the udp_flush_flow() case, where we could get packets
for one flow on a different flow's socket, rather than through a listening
socket as usual.
Signed-off-by: David Gibson
On 1/8/26 03:29, David Gibson wrote:
Add a function to get the name of an IP protocol from its number. Usually this would be done by getprotobynumber(), but that requires access to /etc/protocols and might allocate. We can't do either of those once we've self-isolated.
Signed-off-by: David Gibson
--- ip.c | 27 +++++++++++++++++++++++++++ ip.h | 2 ++ 2 files changed, 29 insertions(+) diff --git a/ip.c b/ip.c index 9a7f4c54..f1d224bd 100644 --- a/ip.c +++ b/ip.c @@ -67,3 +67,30 @@ found: *proto = nh; return true; } + +/** + * ipproto_name() - Get IP protocol name from number + * @proto: IP protocol number + * + * Return: pointer to name of protocol @proto + * + * Usually this would be done with getprotobynumber(3) but that reads + * /etc/protocols and might allocate, which isn't possible for us once + * self-isolated. + */ +/* cppcheck-suppress unusedFunction */ +const char *ipproto_name(uint8_t proto) +{ + switch (proto) { + case IPPROTO_ICMP: + return "ICMP"; + case IPPROTO_TCP: + return "TCP"; + case IPPROTO_UDP: + return "UDP"; + case IPPROTO_ICMPV6: + return "ICMPv6"; + default: + return "<unknown protocol>"; + } +} diff --git a/ip.h b/ip.h index 5830b923..42b417c5 100644 --- a/ip.h +++ b/ip.h @@ -116,6 +116,7 @@ static inline uint32_t ip6_get_flow_lbl(const struct ipv6hdr *ip6h) }
bool ipv6_l4hdr(struct iov_tail *data, uint8_t *proto, size_t *dlen); +const char *ipproto_name(uint8_t proto);
/* IPv6 link-local all-nodes multicast address, ff02::1 */ static const struct in6_addr in6addr_ll_all_nodes = { @@ -135,4 +136,5 @@ static const struct in_addr in4addr_broadcast = { 0xffffffff }; #define IPV6_MIN_MTU 1280 #endif
+
Extra blank line
#endif /* IP_H */
Reviewed-by: Laurent Vivier
On Thu, Jan 08, 2026 at 02:22:21PM +0100, Laurent Vivier wrote:
On 1/8/26 03:29, David Gibson wrote:
Add a function to get the name of an IP protocol from its number. Usually this would be done by getprotobynumber(), but that requires access to /etc/protocols and might allocate. We can't do either of those once we've self-isolated.
Signed-off-by: David Gibson
--- ip.c | 27 +++++++++++++++++++++++++++ ip.h | 2 ++ 2 files changed, 29 insertions(+) diff --git a/ip.c b/ip.c index 9a7f4c54..f1d224bd 100644 --- a/ip.c +++ b/ip.c @@ -67,3 +67,30 @@ found: *proto = nh; return true; } + +/** + * ipproto_name() - Get IP protocol name from number + * @proto: IP protocol number + * + * Return: pointer to name of protocol @proto + * + * Usually this would be done with getprotobynumber(3) but that reads + * /etc/protocols and might allocate, which isn't possible for us once + * self-isolated. + */ +/* cppcheck-suppress unusedFunction */ +const char *ipproto_name(uint8_t proto) +{ + switch (proto) { + case IPPROTO_ICMP: + return "ICMP"; + case IPPROTO_TCP: + return "TCP"; + case IPPROTO_UDP: + return "UDP"; + case IPPROTO_ICMPV6: + return "ICMPv6"; + default: + return "<unknown protocol>"; + } +} diff --git a/ip.h b/ip.h index 5830b923..42b417c5 100644 --- a/ip.h +++ b/ip.h @@ -116,6 +116,7 @@ static inline uint32_t ip6_get_flow_lbl(const struct ipv6hdr *ip6h) } bool ipv6_l4hdr(struct iov_tail *data, uint8_t *proto, size_t *dlen); +const char *ipproto_name(uint8_t proto); /* IPv6 link-local all-nodes multicast address, ff02::1 */ static const struct in6_addr in6addr_ll_all_nodes = { @@ -135,4 +136,5 @@ static const struct in_addr in4addr_broadcast = { 0xffffffff }; #define IPV6_MIN_MTU 1280 #endif +
Extra blank line
Oops, fixed.
#endif /* IP_H */
Reviewed-by: Laurent Vivier
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (2)
-
David Gibson
-
Laurent Vivier