[PATCH 00/11] Preliminaries for UDP flow support
The redesign of UDP flows required (or at least, suggested) a new batch of prelininary changes that don't rely on the core of the flow table rework. David Gibson (11): util: sock_l4() determine protocol from epoll type rather than the reverse flow: Add flow_sidx_valid() helper udp: Pass full epoll reference through more of sock handler path udp: Rename IOV and mmsghdr arrays udp: Unify udp[46]_mh_splice udp: Unify udp[46]_l2_iov udp: Don't repeatedly initialise udp[46]_eth_hdr udp: Move some more of sock_handler tasks into sub-functions udp: Consolidate datagram batching contrib: Add program to document and test assumptions about SO_REUSEADDR contrib: Test behaviour of zero length datagram recv()s contrib/udp-behaviour/.gitignore | 2 + contrib/udp-behaviour/Makefile | 45 +++ contrib/udp-behaviour/common.c | 66 ++++ contrib/udp-behaviour/common.h | 47 +++ contrib/udp-behaviour/recv-zero.c | 74 +++++ contrib/udp-behaviour/reuseaddr-priority.c | 240 ++++++++++++++ epoll_type.h | 41 +++ flow.h | 11 + flow_table.h | 2 +- icmp.c | 2 +- passt.h | 32 -- tcp.c | 17 +- udp.c | 361 ++++++++++----------- util.c | 48 +-- util.h | 3 +- 15 files changed, 735 insertions(+), 256 deletions(-) create mode 100644 contrib/udp-behaviour/.gitignore create mode 100644 contrib/udp-behaviour/Makefile create mode 100644 contrib/udp-behaviour/common.c create mode 100644 contrib/udp-behaviour/common.h create mode 100644 contrib/udp-behaviour/recv-zero.c create mode 100644 contrib/udp-behaviour/reuseaddr-priority.c create mode 100644 epoll_type.h -- 2.45.2
sock_l4() creates a socket of the given IP protocol number, and adds it to
the epoll state. Currently it determines the correct tag for the epoll
data based on the protocol. However, we have some future cases where we
might want different semantics, and therefore epoll types, for sockets of
the same protocol. So, change sock_l4() to take the epoll type as an
explicit parameter, and determine the protocol from that.
Signed-off-by: David Gibson
On Thu, 4 Jul 2024 14:58:25 +1000
David Gibson
sock_l4() creates a socket of the given IP protocol number, and adds it to the epoll state. Currently it determines the correct tag for the epoll data based on the protocol. However, we have some future cases where we might want different semantics, and therefore epoll types, for sockets of the same protocol. So, change sock_l4() to take the epoll type as an explicit parameter, and determine the protocol from that.
The interface is a bit surprising, but I guess it makes later changes much more convenient, so be it. Just one nit:
Signed-off-by: David Gibson
--- epoll_type.h | 41 +++++++++++++++++++++++++++++++++++++++++ icmp.c | 2 +- passt.h | 32 -------------------------------- tcp.c | 10 +++++----- udp.c | 12 ++++++------ util.c | 48 ++++++++++++++++++++++++++---------------------- util.h | 3 ++- 7 files changed, 81 insertions(+), 67 deletions(-) create mode 100644 epoll_type.h diff --git a/epoll_type.h b/epoll_type.h new file mode 100644 index 00000000..42e876e5 --- /dev/null +++ b/epoll_type.h @@ -0,0 +1,41 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright Red Hat + * Author: Davd Gibson
I'm fairly sure it's spellt David. :) -- Stefano
On Thu, Jul 04, 2024 at 11:19:25PM +0200, Stefano Brivio wrote:
On Thu, 4 Jul 2024 14:58:25 +1000 David Gibson
wrote: sock_l4() creates a socket of the given IP protocol number, and adds it to the epoll state. Currently it determines the correct tag for the epoll data based on the protocol. However, we have some future cases where we might want different semantics, and therefore epoll types, for sockets of the same protocol. So, change sock_l4() to take the epoll type as an explicit parameter, and determine the protocol from that.
The interface is a bit surprising, but I guess it makes later changes much more convenient, so be it.
Yes. For the new UDP flow design, I have both "listening" and "reply" sockets which are basically the same at the kernel level, but need different epoll information.
Just one nit:
Signed-off-by: David Gibson
--- epoll_type.h | 41 +++++++++++++++++++++++++++++++++++++++++ icmp.c | 2 +- passt.h | 32 -------------------------------- tcp.c | 10 +++++----- udp.c | 12 ++++++------ util.c | 48 ++++++++++++++++++++++++++---------------------- util.h | 3 ++- 7 files changed, 81 insertions(+), 67 deletions(-) create mode 100644 epoll_type.h diff --git a/epoll_type.h b/epoll_type.h new file mode 100644 index 00000000..42e876e5 --- /dev/null +++ b/epoll_type.h @@ -0,0 +1,41 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright Red Hat + * Author: Davd Gibson
I'm fairly sure it's spellt David. :)
Oops, fixed :). -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
To implement the TCP hash table, we need an invalid (NULL-like) value for
flow_sidx_t. We use FLOW_SIDX_NONE for that, but for defensiveness, we
treat (usually) anything with an out of bounds flow index the same way.
That's not always done consistently though. In flow_at_sidx() we open code
a check on the flow index. In tcp_hash_probe() we instead compare against
FLOW_SIDX_NONE, and in some other places we use the fact that
flow_at_sidx() will return NULL in this case, even if we don't otherwise
need the flow it returns.
Clean this up a bit, by adding an explicit flow_sidx_valid() test function.
Signed-off-by: David Gibson
udp_buf_sock_handler() takes the epoll reference from the receiving socket,
and passes the UDP relevant part on to several other functions. Future
changes are going to need several different epoll types for UDP, and to
pass that information through to some of those functions. To avoid extra
noise in the patches making the real changes, change those functions now
to take the full epoll reference, rather than just the UDP part.
Signed-off-by: David Gibson
On Thu, 4 Jul 2024 14:58:27 +1000
David Gibson
udp_buf_sock_handler() takes the epoll reference from the receiving socket, and passes the UDP relevant part on to several other functions. Future changes are going to need several different epoll types for UDP, and to pass that information through to some of those functions. To avoid extra noise in the patches making the real changes, change those functions now to take the full epoll reference, rather than just the UDP part.
Signed-off-by: David Gibson
--- udp.c | 63 +++++++++++++++++++++++++++++++---------------------------- 1 file changed, 33 insertions(+), 30 deletions(-) diff --git a/udp.c b/udp.c index eadf4872..6301bda2 100644 --- a/udp.c +++ b/udp.c @@ -477,25 +477,26 @@ static int udp_splice_new_ns(void *arg)
/** * udp_mmh_splice_port() - Is source address of message suitable for splicing? - * @uref: UDP epoll reference for incoming message's origin socket + * @ref: Epoll reference for incoming message's origin socket
Nit: in 1/11 (and later in this patch), this is "epoll" instead, which looks more correct to me as it's a proper noun, but not capitalised. Same for udp_update_hdr6().
* @mmh: mmsghdr of incoming message * * Return: if source address of message in @mmh refers to localhost (127.0.0.1 * or ::1) its source port (host order), otherwise -1. */ -static int udp_mmh_splice_port(union udp_epoll_ref uref, - const struct mmsghdr *mmh) +static int udp_mmh_splice_port(union epoll_ref ref, const struct mmsghdr *mmh) { const struct sockaddr_in6 *sa6 = mmh->msg_hdr.msg_name; const struct sockaddr_in *sa4 = mmh->msg_hdr.msg_name;
- if (!uref.splice) + ASSERT(ref.type == EPOLL_TYPE_UDP); + + if (!ref.udp.splice) return -1;
- if (uref.v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr)) + if (ref.udp.v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr)) return ntohs(sa6->sin6_port);
- if (!uref.v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr)) + if (!ref.udp.v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr)) return ntohs(sa4->sin_port);
return -1; @@ -507,7 +508,7 @@ static int udp_mmh_splice_port(union udp_epoll_ref uref, * @start: Index of first datagram in udp[46]_l2_buf * @n: Total number of datagrams in udp[46]_l2_buf pool * @dst: Datagrams will be sent to this port (on destination side) - * @uref: UDP epoll reference for origin socket + * @ref: epoll reference for origin socket * @now: Timestamp * * This consumes as many datagrams as are sendable via a single socket. It @@ -518,7 +519,7 @@ static int udp_mmh_splice_port(union udp_epoll_ref uref, * Return: Number of datagrams forwarded */ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, - in_port_t dst, union udp_epoll_ref uref, + in_port_t dst, union epoll_ref ref, const struct timespec *now) { in_port_t src = udp_meta[start].splicesrc; @@ -527,8 +528,9 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, int s;
ASSERT(udp_meta[start].splicesrc >= 0); + ASSERT(ref.type == EPOLL_TYPE_UDP);
- if (uref.v6) { + if (ref.udp.v6) { mmh_recv = udp6_l2_mh_sock; mmh_send = udp6_mh_splice; udp6_localname.sin6_port = htons(dst); @@ -544,27 +546,27 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, if (++i >= n) break;
- udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]); + udp_meta[i].splicesrc = udp_mmh_splice_port(ref, &mmh_recv[i]); } while (udp_meta[i].splicesrc == src);
- if (uref.pif == PIF_SPLICE) { + if (ref.udp.pif == PIF_SPLICE) { src += c->udp.fwd_in.rdelta[src]; - s = udp_splice_init[uref.v6][src].sock; - if (s < 0 && uref.orig) - s = udp_splice_new(c, uref.v6, src, false); + s = udp_splice_init[ref.udp.v6][src].sock; + if (s < 0 && ref.udp.orig) + s = udp_splice_new(c, ref.udp.v6, src, false);
if (s < 0) goto out;
- udp_splice_ns[uref.v6][dst].ts = now->tv_sec; - udp_splice_init[uref.v6][src].ts = now->tv_sec; + udp_splice_ns[ref.udp.v6][dst].ts = now->tv_sec; + udp_splice_init[ref.udp.v6][src].ts = now->tv_sec; } else { - ASSERT(uref.pif == PIF_HOST); + ASSERT(ref.udp.pif == PIF_HOST); src += c->udp.fwd_out.rdelta[src]; - s = udp_splice_ns[uref.v6][src].sock; - if (s < 0 && uref.orig) { + s = udp_splice_ns[ref.udp.v6][src].sock; + if (s < 0 && ref.udp.orig) { struct udp_splice_new_ns_arg arg = { - c, uref.v6, src, -1, + c, ref.udp.v6, src, -1, };
NS_CALL(udp_splice_new_ns, &arg); @@ -573,8 +575,8 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, if (s < 0) goto out;
- udp_splice_init[uref.v6][dst].ts = now->tv_sec; - udp_splice_ns[uref.v6][src].ts = now->tv_sec; + udp_splice_init[ref.udp.v6][dst].ts = now->tv_sec; + udp_splice_ns[ref.udp.v6][src].ts = now->tv_sec; }
sendmmsg(s, mmh_send + start, i - start, MSG_NOSIGNAL); @@ -716,7 +718,7 @@ static size_t udp_update_hdr6(const struct ctx *c, * @start: Index of first datagram in udp[46]_l2_buf pool * @n: Total number of datagrams in udp[46]_l2_buf pool * @dstport: Destination port number on destination side - * @uref: UDP epoll reference for origin socket + * @ref: Epoll reference for origin socket * @now: Current timestamp * * This consumes as many frames as are sendable via tap. It requires that @@ -726,7 +728,7 @@ static size_t udp_update_hdr6(const struct ctx *c, * Return: Number of frames sent via tap */ static unsigned udp_tap_send(const struct ctx *c, size_t start, size_t n, - in_port_t dstport, union udp_epoll_ref uref, + in_port_t dstport, union epoll_ref ref, const struct timespec *now) { struct iovec (*tap_iov)[UDP_NUM_IOVS]; @@ -734,8 +736,9 @@ static unsigned udp_tap_send(const struct ctx *c, size_t start, size_t n, size_t i = start;
ASSERT(udp_meta[start].splicesrc == -1); + ASSERT(ref.type == EPOLL_TYPE_UDP);
- if (uref.v6) { + if (ref.udp.v6) { tap_iov = udp6_l2_iov_tap; mmh_recv = udp6_l2_mh_sock; } else { @@ -748,7 +751,7 @@ static unsigned udp_tap_send(const struct ctx *c, size_t start, size_t n, struct udp_meta_t *bm = &udp_meta[i]; size_t l4len;
- if (uref.v6) { + if (ref.udp.v6) { l4len = udp_update_hdr6(c, &bm->ip6h, &bm->s_in.sa6, bp, dstport, udp6_l2_mh_sock[i].msg_len, now); @@ -766,7 +769,7 @@ static unsigned udp_tap_send(const struct ctx *c, size_t start, size_t n, if (++i >= n) break;
- udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]); + udp_meta[i].splicesrc = udp_mmh_splice_port(ref, &mmh_recv[i]); } while (udp_meta[i].splicesrc == -1);
tap_send_frames(c, &tap_iov[start][0], UDP_NUM_IOVS, i - start); @@ -823,12 +826,12 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve * present). So we fill in entry 0 before the loop, then udp_*_send() * populate one entry past where they consume. */ - udp_meta[0].splicesrc = udp_mmh_splice_port(ref.udp, mmh_recv); + udp_meta[0].splicesrc = udp_mmh_splice_port(ref, mmh_recv); for (i = 0; i < n; i += m) { if (udp_meta[i].splicesrc >= 0) - m = udp_splice_send(c, i, n, dstport, ref.udp, now); + m = udp_splice_send(c, i, n, dstport, ref, now); else - m = udp_tap_send(c, i, n, dstport, ref.udp, now); + m = udp_tap_send(c, i, n, dstport, ref, now); } }
-- Stefano
On Thu, Jul 04, 2024 at 11:20:07PM +0200, Stefano Brivio wrote:
On Thu, 4 Jul 2024 14:58:27 +1000 David Gibson
wrote: udp_buf_sock_handler() takes the epoll reference from the receiving socket, and passes the UDP relevant part on to several other functions. Future changes are going to need several different epoll types for UDP, and to pass that information through to some of those functions. To avoid extra noise in the patches making the real changes, change those functions now to take the full epoll reference, rather than just the UDP part.
Signed-off-by: David Gibson
--- udp.c | 63 +++++++++++++++++++++++++++++++---------------------------- 1 file changed, 33 insertions(+), 30 deletions(-) diff --git a/udp.c b/udp.c index eadf4872..6301bda2 100644 --- a/udp.c +++ b/udp.c @@ -477,25 +477,26 @@ static int udp_splice_new_ns(void *arg)
/** * udp_mmh_splice_port() - Is source address of message suitable for splicing? - * @uref: UDP epoll reference for incoming message's origin socket + * @ref: Epoll reference for incoming message's origin socket
Nit: in 1/11 (and later in this patch), this is "epoll" instead, which looks more correct to me as it's a proper noun, but not capitalised. Same for udp_update_hdr6().
Good point, fixed. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Make the salient points about these various arrays clearer with renames:
* udp_l2_iov_sock and udp[46]_l2_mh_sock don't really have anything to do
with L2. They are, however, specific to receiving not sending not
receiving. Rename to udp_iov_recv and udp[46]_mh_recv.
* udp[46]_l2_iov_tap is redundant - "tap" implies L2 and vice versa.
Rename to udp[46]_l2_iov
* udp[46]_localname are (for now) pre-populated with the locan address
but the more salient point is that these are the destination address for
the splice arrays. Rename to udp[46]_spliceto
Signed-off-by: David Gibson
On Thu, 4 Jul 2024 14:58:28 +1000
David Gibson
Make the salient points about these various arrays clearer with renames:
* udp_l2_iov_sock and udp[46]_l2_mh_sock don't really have anything to do with L2.
The original idea behind that 'l2' there was to have the type of destination in the name first, and then the source ('sock'). On the other hand, they're actually clearer this way.
They are, however, specific to receiving not sending not receiving.
I failed to decrypt this one. :)
Rename to udp_iov_recv and udp[46]_mh_recv.
* udp[46]_l2_iov_tap is redundant - "tap" implies L2 and vice versa. Rename to udp[46]_l2_iov
* udp[46]_localname are (for now) pre-populated with the locan address but the more salient point is that these are the destination address for the splice arrays. Rename to udp[46]_spliceto
Very slight preference (but not worth a lot of changes, in case): udp[46]_splice_to. To me it's not immediately obvious those are two words otherwise. -- Stefano
On Thu, Jul 04, 2024 at 11:20:53PM +0200, Stefano Brivio wrote:
On Thu, 4 Jul 2024 14:58:28 +1000 David Gibson
wrote: Make the salient points about these various arrays clearer with renames:
* udp_l2_iov_sock and udp[46]_l2_mh_sock don't really have anything to do with L2.
The original idea behind that 'l2' there was to have the type of destination in the name first, and then the source ('sock').
Ah! That makes sense once you know.
On the other hand, they're actually clearer this way.
Right, I think this is more obvious out the gate.
They are, however, specific to receiving not sending not receiving.
I failed to decrypt this one. :)
Oops, corrected.
Rename to udp_iov_recv and udp[46]_mh_recv.
* udp[46]_l2_iov_tap is redundant - "tap" implies L2 and vice versa. Rename to udp[46]_l2_iov
* udp[46]_localname are (for now) pre-populated with the locan address but the more salient point is that these are the destination address for the splice arrays. Rename to udp[46]_spliceto
Very slight preference (but not worth a lot of changes, in case): udp[46]_splice_to. To me it's not immediately obvious those are two words otherwise.
Easy fix, and on consideration I prefer "splice_to" as well. Changed. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
We have separate mmsghdr arrays for splicing IPv4 and IPv6 packets, where
the only difference is that they point to different sockaddr buffers for
the destination address.
Unify these by having the common array point at a sockaddr_inany as the
address. This does mean slightly more work when we're about to splice,
because we need to write the whole socket address, rather than just the
port. However it removes 32 mmsghdr structures and we're going to need
more flexibility constructing that target address for the flow table.
Because future changes might mean that the address isn't always loopback,
change the name of the common address from *_localname to udp_splicename.
Signed-off-by: David Gibson
The only differences between these arrays are that udp4_l2_iov is
pre-initialised to point to the IPv4 ethernet header, and IPv4 per-frame
header and udp6_l2_iov points to the IPv6 versions.
We already have to set up a bunch of headers per-frame, including updating
udp[46]_l2_iov[i][UDP_IOV_PAYLOAD].iov_len. It makes more sense to adjust
the IOV entries to point at the correct headers for the frame than to have
two complete sets of iovecs.
Signed-off-by: David Gibson
Since we split our packet frame buffers into different pieces, we have
a single buffer per IP version for the ethernet header, rather than one
per frame. This makes sense since our ethernet header is alwaus the same.
However we initialise those buffers udp[46]_eth_hdr inside a per frame
loop. Pull that outside the loop so we just initialise them once.
Signed-off-by: David Gibson
udp_buf_sock_handler(), udp_splice_send() and udp_tap_send loosely, do four
things between them:
1. Receive some datagrams from a socket
2. Split those datagrams into batches depending on how they need to be
sent (via tap or via a specific splice socket)
3. Prepare buffers for each datagram to send it onwards
4. Actually send it onwards
Split (1) and (3) into specific helper functions. This isn't
immediately useful (udp_splice_prepare(), in particular, is trivial),
but it will make further reworks clearer.
Signed-off-by: David Gibson
When we receive datagrams on a socket, we need to split them into batches
depending on how they need to be forwarded (either via a specific splice
socket, or via tap). The logic to do this, is somewhat awkwardly split
between udp_buf_sock_handler() itself, udp_splice_send() and
udp_tap_send().
Move all the batching logic into udp_buf_sock_handler(), leaving
udp_splice_send() to just send the prepared batch. udp_tap_send() reduces
to just a call to tap_send_frames() so open-code that call in
udp_buf_sock_handler().
This will allow separating the batching logic from the rest of the datagram
forwarding logic, which we'll need for upcoming flow table support.
Signed-off-by: David Gibson
On Thu, 4 Jul 2024 14:58:33 +1000
David Gibson
When we receive datagrams on a socket, we need to split them into batches depending on how they need to be forwarded (either via a specific splice socket, or via tap). The logic to do this, is somewhat awkwardly split between udp_buf_sock_handler() itself, udp_splice_send() and udp_tap_send().
Move all the batching logic into udp_buf_sock_handler(), leaving udp_splice_send() to just send the prepared batch. udp_tap_send() reduces to just a call to tap_send_frames() so open-code that call in udp_buf_sock_handler().
This will allow separating the batching logic from the rest of the datagram forwarding logic, which we'll need for upcoming flow table support.
Signed-off-by: David Gibson
--- udp.c | 128 ++++++++++++++++++---------------------------------------- 1 file changed, 39 insertions(+), 89 deletions(-) diff --git a/udp.c b/udp.c index 8ed59639..a317e986 100644 --- a/udp.c +++ b/udp.c @@ -501,42 +501,29 @@ static void udp_splice_prepare(struct mmsghdr *mmh, unsigned idx) }
/** - * udp_splice_send() - Send datagrams from socket to socket + * udp_splice_send() - Send a batch of datagrams from socket to socket * @c: Execution context - * @start: Index of first datagram in udp[46]_l2_buf - * @n: Total number of datagrams in udp[46]_l2_buf pool - * @dst: Datagrams will be sent to this port (on destination side) + * @start: Index of batch's first datagram in udp[46]_l2_buf + * @n: Number of datagrams in batch + * @src: Source port for datagram (target side) + * @dst: Destination port for datagrams (target side) * @ref: epoll reference for origin socket * @now: Timestamp - * - * This consumes as many datagrams as are sendable via a single socket. It - * requires that udp_meta[@start].splicesrc is initialised, and will initialise - * udp_meta[].splicesrc for each datagram it consumes *and one more* (if - * present). - * - * Return: Number of datagrams forwarded */ -static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, - in_port_t dst, union epoll_ref ref, - const struct timespec *now) +static void udp_splice_send(const struct ctx *c, size_t start, size_t n, + in_port_t src, in_port_t dst, + union epoll_ref ref, + const struct timespec *now) { - in_port_t src = udp_meta[start].splicesrc; - struct mmsghdr *mmh_recv; - unsigned int i = start; int s;
- ASSERT(udp_meta[start].splicesrc >= 0); - ASSERT(ref.type == EPOLL_TYPE_UDP); - if (ref.udp.v6) { - mmh_recv = udp6_mh_recv; udp_spliceto.sa6 = (struct sockaddr_in6) { .sin6_family = AF_INET6, .sin6_addr = in6addr_loopback, .sin6_port = htons(dst), }; } else { - mmh_recv = udp4_mh_recv; udp_spliceto.sa4 = (struct sockaddr_in) { .sin_family = AF_INET, .sin_addr = in4addr_loopback, @@ -544,15 +531,6 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, }; }
- do { - udp_splice_prepare(mmh_recv, i); - - if (++i >= n) - break; - - udp_meta[i].splicesrc = udp_mmh_splice_port(ref, &mmh_recv[i]); - } while (udp_meta[i].splicesrc == src); - if (ref.udp.pif == PIF_SPLICE) { src += c->udp.fwd_in.rdelta[src]; s = udp_splice_init[ref.udp.v6][src].sock; @@ -560,7 +538,7 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, s = udp_splice_new(c, ref.udp.v6, src, false);
if (s < 0) - goto out; + return;
udp_splice_ns[ref.udp.v6][dst].ts = now->tv_sec; udp_splice_init[ref.udp.v6][src].ts = now->tv_sec; @@ -577,15 +555,13 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, s = arg.s; } if (s < 0) - goto out; + return;
udp_splice_init[ref.udp.v6][dst].ts = now->tv_sec; udp_splice_ns[ref.udp.v6][src].ts = now->tv_sec; }
- sendmmsg(s, udp_mh_splice + start, i - start, MSG_NOSIGNAL); -out: - return i - start; + sendmmsg(s, udp_mh_splice + start, n, MSG_NOSIGNAL); }
/** @@ -725,7 +701,7 @@ static size_t udp_update_hdr6(const struct ctx *c, * @v6: Prepare for IPv6? * @now: Current timestamp */ -static void udp_tap_prepare(const struct ctx *c, struct mmsghdr *mmh, +static void udp_tap_prepare(const struct ctx *c, const struct mmsghdr *mmh, unsigned idx, in_port_t dstport, bool v6, const struct timespec *now) { @@ -752,49 +728,6 @@ static void udp_tap_prepare(const struct ctx *c, struct mmsghdr *mmh, (*tap_iov)[UDP_IOV_PAYLOAD].iov_len = l4len; }
-/** - * udp_tap_send() - Prepare UDP datagrams and send to tap interface - * @c: Execution context - * @start: Index of first datagram in udp[46]_l2_buf pool - * @n: Total number of datagrams in udp[46]_l2_buf pool - * @dstport: Destination port number on destination side - * @ref: Epoll reference for origin socket - * @now: Current timestamp - * - * This consumes as many frames as are sendable via tap. It requires that - * udp_meta[@start].splicesrc is initialised, and will initialise - * udp_meta[].splicesrc for each frame it consumes *and one more* (if present). - * - * Return: Number of frames sent via tap - */ -static unsigned udp_tap_send(const struct ctx *c, size_t start, size_t n, - in_port_t dstport, union epoll_ref ref, - const struct timespec *now) -{ - struct mmsghdr *mmh_recv; - size_t i = start; - - ASSERT(udp_meta[start].splicesrc == -1); - ASSERT(ref.type == EPOLL_TYPE_UDP); - - if (ref.udp.v6) - mmh_recv = udp6_mh_recv; - else - mmh_recv = udp4_mh_recv; - - do { - udp_tap_prepare(c, mmh_recv, i, dstport, ref.udp.v6, now); - - if (++i >= n) - break; - - udp_meta[i].splicesrc = udp_mmh_splice_port(ref, &mmh_recv[i]); - } while (udp_meta[i].splicesrc == -1); - - tap_send_frames(c, &udp_l2_iov[start][0], UDP_NUM_IOVS, i - start); - return i - start; -} - /** * udp_sock_recv() - Receive datagrams from a socket * @c: Execution context @@ -842,7 +775,7 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve { struct mmsghdr *mmh_recv = ref.udp.v6 ? udp6_mh_recv : udp4_mh_recv; in_port_t dstport = ref.udp.port; - int n, m, i; + int n, i;
if ((n = udp_sock_recv(c, ref.fd, events, mmh_recv)) <= 0) return; @@ -852,19 +785,36 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve else if (ref.udp.pif == PIF_HOST) dstport += c->udp.fwd_in.f.delta[dstport];
- /* We divide things into batches based on how we need to send them, + /* We divide datagrams into batches based on how we need to send them, * determined by udp_meta[i].splicesrc. To avoid either two passes * through the array, or recalculating splicesrc for a single entry, we - * have to populate it one entry *ahead* of the loop counter (if - * present). So we fill in entry 0 before the loop, then udp_*_send() - * populate one entry past where they consume. + * have to populate it one entry *ahead* of the loop counter. */ udp_meta[0].splicesrc = udp_mmh_splice_port(ref, mmh_recv); - for (i = 0; i < n; i += m) { - if (udp_meta[i].splicesrc >= 0) - m = udp_splice_send(c, i, n, dstport, ref, now); + for (i = 0; i < n; ) { + int batchsrc = udp_meta[i].splicesrc; + int batchstart = i; + + do { + if (batchsrc >= 0) + udp_splice_prepare(mmh_recv, i); + else + udp_tap_prepare(c, mmh_recv, i, dstport, + ref.udp.v6, now); + + if (++i >= n) + break; + + udp_meta[i].splicesrc = udp_mmh_splice_port(ref, + &mmh_recv[i]); + } while (udp_meta[i].splicesrc == batchsrc); + + if (batchsrc >= 0) + udp_splice_send(c, batchstart, i - batchstart, + batchsrc, dstport, ref, now); else - m = udp_tap_send(c, i, n, dstport, ref, now); + tap_send_frames(c, &udp_l2_iov[batchstart][0], + UDP_NUM_IOVS, i - batchstart); }
The logic looks correct to me, but the nested loop makes it a bit hard to grasp. I'm wondering if we shouldn't rather have a single loop, always preparing the datagrams, noting down the previous udp_meta[i].splicesrc and the first index of a batch, and starting a new batch (sending the previous one) once the current udp_meta[i].splicesrc doesn't match the previous value. I tried to sketch this quickly but failed for the moment to come up with anything vaguely elegant, so I'm fine with either version. Nits: curly brackets around multiple lines. -- Stefano
On Fri, Jul 05, 2024 at 11:10:45AM +0200, Stefano Brivio wrote:
On Thu, 4 Jul 2024 14:58:33 +1000 David Gibson
wrote: When we receive datagrams on a socket, we need to split them into batches depending on how they need to be forwarded (either via a specific splice socket, or via tap). The logic to do this, is somewhat awkwardly split between udp_buf_sock_handler() itself, udp_splice_send() and udp_tap_send().
Move all the batching logic into udp_buf_sock_handler(), leaving udp_splice_send() to just send the prepared batch. udp_tap_send() reduces to just a call to tap_send_frames() so open-code that call in udp_buf_sock_handler().
This will allow separating the batching logic from the rest of the datagram forwarding logic, which we'll need for upcoming flow table support.
Signed-off-by: David Gibson
--- udp.c | 128 ++++++++++++++++++---------------------------------------- 1 file changed, 39 insertions(+), 89 deletions(-) diff --git a/udp.c b/udp.c index 8ed59639..a317e986 100644 --- a/udp.c +++ b/udp.c @@ -501,42 +501,29 @@ static void udp_splice_prepare(struct mmsghdr *mmh, unsigned idx) }
/** - * udp_splice_send() - Send datagrams from socket to socket + * udp_splice_send() - Send a batch of datagrams from socket to socket * @c: Execution context - * @start: Index of first datagram in udp[46]_l2_buf - * @n: Total number of datagrams in udp[46]_l2_buf pool - * @dst: Datagrams will be sent to this port (on destination side) + * @start: Index of batch's first datagram in udp[46]_l2_buf + * @n: Number of datagrams in batch + * @src: Source port for datagram (target side) + * @dst: Destination port for datagrams (target side) * @ref: epoll reference for origin socket * @now: Timestamp - * - * This consumes as many datagrams as are sendable via a single socket. It - * requires that udp_meta[@start].splicesrc is initialised, and will initialise - * udp_meta[].splicesrc for each datagram it consumes *and one more* (if - * present). - * - * Return: Number of datagrams forwarded */ -static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, - in_port_t dst, union epoll_ref ref, - const struct timespec *now) +static void udp_splice_send(const struct ctx *c, size_t start, size_t n, + in_port_t src, in_port_t dst, + union epoll_ref ref, + const struct timespec *now) { - in_port_t src = udp_meta[start].splicesrc; - struct mmsghdr *mmh_recv; - unsigned int i = start; int s;
- ASSERT(udp_meta[start].splicesrc >= 0); - ASSERT(ref.type == EPOLL_TYPE_UDP); - if (ref.udp.v6) { - mmh_recv = udp6_mh_recv; udp_spliceto.sa6 = (struct sockaddr_in6) { .sin6_family = AF_INET6, .sin6_addr = in6addr_loopback, .sin6_port = htons(dst), }; } else { - mmh_recv = udp4_mh_recv; udp_spliceto.sa4 = (struct sockaddr_in) { .sin_family = AF_INET, .sin_addr = in4addr_loopback, @@ -544,15 +531,6 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, }; }
- do { - udp_splice_prepare(mmh_recv, i); - - if (++i >= n) - break; - - udp_meta[i].splicesrc = udp_mmh_splice_port(ref, &mmh_recv[i]); - } while (udp_meta[i].splicesrc == src); - if (ref.udp.pif == PIF_SPLICE) { src += c->udp.fwd_in.rdelta[src]; s = udp_splice_init[ref.udp.v6][src].sock; @@ -560,7 +538,7 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, s = udp_splice_new(c, ref.udp.v6, src, false);
if (s < 0) - goto out; + return;
udp_splice_ns[ref.udp.v6][dst].ts = now->tv_sec; udp_splice_init[ref.udp.v6][src].ts = now->tv_sec; @@ -577,15 +555,13 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, s = arg.s; } if (s < 0) - goto out; + return;
udp_splice_init[ref.udp.v6][dst].ts = now->tv_sec; udp_splice_ns[ref.udp.v6][src].ts = now->tv_sec; }
- sendmmsg(s, udp_mh_splice + start, i - start, MSG_NOSIGNAL); -out: - return i - start; + sendmmsg(s, udp_mh_splice + start, n, MSG_NOSIGNAL); }
/** @@ -725,7 +701,7 @@ static size_t udp_update_hdr6(const struct ctx *c, * @v6: Prepare for IPv6? * @now: Current timestamp */ -static void udp_tap_prepare(const struct ctx *c, struct mmsghdr *mmh, +static void udp_tap_prepare(const struct ctx *c, const struct mmsghdr *mmh, unsigned idx, in_port_t dstport, bool v6, const struct timespec *now) { @@ -752,49 +728,6 @@ static void udp_tap_prepare(const struct ctx *c, struct mmsghdr *mmh, (*tap_iov)[UDP_IOV_PAYLOAD].iov_len = l4len; }
-/** - * udp_tap_send() - Prepare UDP datagrams and send to tap interface - * @c: Execution context - * @start: Index of first datagram in udp[46]_l2_buf pool - * @n: Total number of datagrams in udp[46]_l2_buf pool - * @dstport: Destination port number on destination side - * @ref: Epoll reference for origin socket - * @now: Current timestamp - * - * This consumes as many frames as are sendable via tap. It requires that - * udp_meta[@start].splicesrc is initialised, and will initialise - * udp_meta[].splicesrc for each frame it consumes *and one more* (if present). - * - * Return: Number of frames sent via tap - */ -static unsigned udp_tap_send(const struct ctx *c, size_t start, size_t n, - in_port_t dstport, union epoll_ref ref, - const struct timespec *now) -{ - struct mmsghdr *mmh_recv; - size_t i = start; - - ASSERT(udp_meta[start].splicesrc == -1); - ASSERT(ref.type == EPOLL_TYPE_UDP); - - if (ref.udp.v6) - mmh_recv = udp6_mh_recv; - else - mmh_recv = udp4_mh_recv; - - do { - udp_tap_prepare(c, mmh_recv, i, dstport, ref.udp.v6, now); - - if (++i >= n) - break; - - udp_meta[i].splicesrc = udp_mmh_splice_port(ref, &mmh_recv[i]); - } while (udp_meta[i].splicesrc == -1); - - tap_send_frames(c, &udp_l2_iov[start][0], UDP_NUM_IOVS, i - start); - return i - start; -} - /** * udp_sock_recv() - Receive datagrams from a socket * @c: Execution context @@ -842,7 +775,7 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve { struct mmsghdr *mmh_recv = ref.udp.v6 ? udp6_mh_recv : udp4_mh_recv; in_port_t dstport = ref.udp.port; - int n, m, i; + int n, i;
if ((n = udp_sock_recv(c, ref.fd, events, mmh_recv)) <= 0) return; @@ -852,19 +785,36 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve else if (ref.udp.pif == PIF_HOST) dstport += c->udp.fwd_in.f.delta[dstport];
- /* We divide things into batches based on how we need to send them, + /* We divide datagrams into batches based on how we need to send them, * determined by udp_meta[i].splicesrc. To avoid either two passes * through the array, or recalculating splicesrc for a single entry, we - * have to populate it one entry *ahead* of the loop counter (if - * present). So we fill in entry 0 before the loop, then udp_*_send() - * populate one entry past where they consume. + * have to populate it one entry *ahead* of the loop counter. */ udp_meta[0].splicesrc = udp_mmh_splice_port(ref, mmh_recv); - for (i = 0; i < n; i += m) { - if (udp_meta[i].splicesrc >= 0) - m = udp_splice_send(c, i, n, dstport, ref, now); + for (i = 0; i < n; ) { + int batchsrc = udp_meta[i].splicesrc; + int batchstart = i; + + do { + if (batchsrc >= 0) + udp_splice_prepare(mmh_recv, i); + else + udp_tap_prepare(c, mmh_recv, i, dstport, + ref.udp.v6, now); + + if (++i >= n) + break; + + udp_meta[i].splicesrc = udp_mmh_splice_port(ref, + &mmh_recv[i]); + } while (udp_meta[i].splicesrc == batchsrc); + + if (batchsrc >= 0) + udp_splice_send(c, batchstart, i - batchstart, + batchsrc, dstport, ref, now); else - m = udp_tap_send(c, i, n, dstport, ref, now); + tap_send_frames(c, &udp_l2_iov[batchstart][0], + UDP_NUM_IOVS, i - batchstart); }
The logic looks correct to me, but the nested loop makes it a bit hard to grasp.
I don't disagree it's pretty hard to follow, but I haven't really seen a better way.
I'm wondering if we shouldn't rather have a single loop, always preparing the datagrams, noting down the previous udp_meta[i].splicesrc and the first index of a batch, and starting a new batch (sending the previous one) once the current udp_meta[i].splicesrc doesn't match the previous value.
I can't really picture what you have in mind here.
I tried to sketch this quickly but failed for the moment to come up with anything vaguely elegant, so I'm fine with either version.
Nits: curly brackets around multiple lines.
That, at least, I can fix. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
For the approach we intend to use for handling UDP flows, we have some
pretty specific requirements about how SO_REUSEADDR works with UDP sockets.
Specifically SO_REUSEADDR allows multiple sockets with overlapping bind()s,
and therefore there can be multiple sockets which are eligible to receive
the same datagram. Which one will actually receive it is important to us.
Add a test program which verifies things work the way we expect, which
documents what those expectations are in the process.
Signed-off-by: David Gibson
On Thu, 4 Jul 2024 14:58:34 +1000
David Gibson
For the approach we intend to use for handling UDP flows, we have some pretty specific requirements about how SO_REUSEADDR works with UDP sockets. Specifically SO_REUSEADDR allows multiple sockets with overlapping bind()s, and therefore there can be multiple sockets which are eligible to receive the same datagram. Which one will actually receive it is important to us.
Add a test program which verifies things work the way we expect, which documents what those expectations are in the process.
Signed-off-by: David Gibson
--- contrib/udp-behaviour/.gitignore | 1 + contrib/udp-behaviour/Makefile | 45 ++++ contrib/udp-behaviour/common.c | 66 ++++++ contrib/udp-behaviour/common.h | 47 ++++ contrib/udp-behaviour/reuseaddr-priority.c | 240 +++++++++++++++++++++ 5 files changed, 399 insertions(+)
I reviewed these (10/11 and 11/11) a bit lightly, but they look sane to me. I just wonder: wouldn't it be better to have contrib/linux/udp-behaviour instead, so that it's consistent with the other stuff unter contrib/ (project names, kind of)? By the way, I reviewed everything else except for 9/11. That will take me a bit longer. And for the rest, I just have nits that I could also take care of on merge. -- Stefano
On Thu, Jul 04, 2024 at 11:21:21PM +0200, Stefano Brivio wrote:
On Thu, 4 Jul 2024 14:58:34 +1000 David Gibson
wrote: For the approach we intend to use for handling UDP flows, we have some pretty specific requirements about how SO_REUSEADDR works with UDP sockets. Specifically SO_REUSEADDR allows multiple sockets with overlapping bind()s, and therefore there can be multiple sockets which are eligible to receive the same datagram. Which one will actually receive it is important to us.
Add a test program which verifies things work the way we expect, which documents what those expectations are in the process.
Signed-off-by: David Gibson
--- contrib/udp-behaviour/.gitignore | 1 + contrib/udp-behaviour/Makefile | 45 ++++ contrib/udp-behaviour/common.c | 66 ++++++ contrib/udp-behaviour/common.h | 47 ++++ contrib/udp-behaviour/reuseaddr-priority.c | 240 +++++++++++++++++++++ 5 files changed, 399 insertions(+) I reviewed these (10/11 and 11/11) a bit lightly, but they look sane to me. I just wonder: wouldn't it be better to have contrib/linux/udp-behaviour instead, so that it's consistent with the other stuff unter contrib/ (project names, kind of)?
Well.. if we ever port to something non-Linux, we'll need the same socket behaviour there. Indeed, that's one reason I think having these test programs is valuable. So I don't think 'linux/' is a great pick. In some ways contrib/ isn't really the right place for this. Maybe it would be better under doc/? But at the moment that's more user facing than developer facing documentation.
By the way, I reviewed everything else except for 9/11. That will take me a bit longer. And for the rest, I just have nits that I could also take care of on merge.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 5 Jul 2024 10:06:12 +1000
David Gibson
On Thu, Jul 04, 2024 at 11:21:21PM +0200, Stefano Brivio wrote:
On Thu, 4 Jul 2024 14:58:34 +1000 David Gibson
wrote: For the approach we intend to use for handling UDP flows, we have some pretty specific requirements about how SO_REUSEADDR works with UDP sockets. Specifically SO_REUSEADDR allows multiple sockets with overlapping bind()s, and therefore there can be multiple sockets which are eligible to receive the same datagram. Which one will actually receive it is important to us.
Add a test program which verifies things work the way we expect, which documents what those expectations are in the process.
Signed-off-by: David Gibson
--- contrib/udp-behaviour/.gitignore | 1 + contrib/udp-behaviour/Makefile | 45 ++++ contrib/udp-behaviour/common.c | 66 ++++++ contrib/udp-behaviour/common.h | 47 ++++ contrib/udp-behaviour/reuseaddr-priority.c | 240 +++++++++++++++++++++ 5 files changed, 399 insertions(+) I reviewed these (10/11 and 11/11) a bit lightly, but they look sane to me. I just wonder: wouldn't it be better to have contrib/linux/udp-behaviour instead, so that it's consistent with the other stuff unter contrib/ (project names, kind of)?
Well.. if we ever port to something non-Linux, we'll need the same socket behaviour there. Indeed, that's one reason I think having these test programs is valuable. So I don't think 'linux/' is a great pick.
Oh, oops, I thought SO_REUSEADDR were specific to Linux, that's why I was suggesting linux/, but it's actually supported by all the BSDs.
In some ways contrib/ isn't really the right place for this. Maybe it would be better under doc/? But at the moment that's more user facing than developer facing documentation.
I would still say it's documentation and it can happily fit under doc/. Distribution packages don't copy the whole doc/ (to /usr/share/doc/) anyway. Or test/kernel/? But it's not something we want to check regularly, it's really an example to help with development. All in all, I don't have a strong preference, doc/ looks like a better fit to me, but contrib/ isn't problematic either. -- Stefano
On Fri, Jul 05, 2024 at 10:33:43AM +0200, Stefano Brivio wrote:
On Fri, 5 Jul 2024 10:06:12 +1000 David Gibson
wrote: On Thu, Jul 04, 2024 at 11:21:21PM +0200, Stefano Brivio wrote:
On Thu, 4 Jul 2024 14:58:34 +1000 David Gibson
wrote: For the approach we intend to use for handling UDP flows, we have some pretty specific requirements about how SO_REUSEADDR works with UDP sockets. Specifically SO_REUSEADDR allows multiple sockets with overlapping bind()s, and therefore there can be multiple sockets which are eligible to receive the same datagram. Which one will actually receive it is important to us.
Add a test program which verifies things work the way we expect, which documents what those expectations are in the process.
Signed-off-by: David Gibson
--- contrib/udp-behaviour/.gitignore | 1 + contrib/udp-behaviour/Makefile | 45 ++++ contrib/udp-behaviour/common.c | 66 ++++++ contrib/udp-behaviour/common.h | 47 ++++ contrib/udp-behaviour/reuseaddr-priority.c | 240 +++++++++++++++++++++ 5 files changed, 399 insertions(+) I reviewed these (10/11 and 11/11) a bit lightly, but they look sane to me. I just wonder: wouldn't it be better to have contrib/linux/udp-behaviour instead, so that it's consistent with the other stuff unter contrib/ (project names, kind of)?
Well.. if we ever port to something non-Linux, we'll need the same socket behaviour there. Indeed, that's one reason I think having these test programs is valuable. So I don't think 'linux/' is a great pick.
Oh, oops, I thought SO_REUSEADDR were specific to Linux, that's why I was suggesting linux/, but it's actually supported by all the BSDs.
RIght. I believe SO_REUSEPORT is Linux specific, but the weaker SO_REUSEADDR is much older, and is all I need for the things I have planned.
In some ways contrib/ isn't really the right place for this. Maybe it would be better under doc/? But at the moment that's more user facing than developer facing documentation.
I would still say it's documentation and it can happily fit under doc/. Distribution packages don't copy the whole doc/ (to /usr/share/doc/) anyway.
Ok, I'm going with doc/platform-requirements/ in the next spin.
Or test/kernel/? But it's not something we want to check regularly, it's really an example to help with development.
All in all, I don't have a strong preference, doc/ looks like a better fit to me, but contrib/ isn't problematic either.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Add a test program verifying that we're able to discard datagrams from a
socket without needing a big discard buffer, by using a zero length recv().
Signed-off-by: David Gibson
participants (2)
-
David Gibson
-
Stefano Brivio