[RFC PATCH 0/5] Pad all inbound frames to 802.3 minimum size if needed
Patch 1/5 handles the easy non-batched case with a copy to a padded buffer (only if needed). Patches 2/5 and 3/5 clean coding style up before further changes. Patch 4/5 deals with TCP and UDP batched frames in non-vhost-user modes. Patch 5/5 is for batched frames in vhost-user mode instead, and this is the part I'm the least confident about, in particular with regard to the size assumptions on vhost-user provided buffers. Stefano Brivio (5): tap: Pad non-batched frames to 802.3 minimum (60 bytes) if needed tcp: Fix coding style for comment to enum tcp_iov_parts udp: Fix coding style for comment to enum udp_iov_idx tcp, udp: Pad batched frames to 60 bytes (802.3 minimum) in non-vhost-user modes tcp, udp: Pad batched frames for vhost-user modes to 60 bytes (802.3 minimum) tap.c | 7 +++++++ tcp.c | 2 -- tcp_buf.c | 23 +++++++++++++++++++++++ tcp_internal.h | 5 ++++- tcp_vu.c | 27 +++++++++++++++++++++++++++ udp.c | 31 ++++++++++++++++++++++++++----- udp_vu.c | 11 ++++++++++- util.c | 3 +++ util.h | 3 +++ 9 files changed, 103 insertions(+), 9 deletions(-) -- 2.43.0
IEEE 802.3 requires a minimum frame payload of 46 bytes, without a
802.1Q tag. Add padding for the simple tap_send_single() case using
a zero-filled 60-byte buffer and copying data to it if needed.
In theory, we could add a further element in the iovec array, say:
uint8_t padding[ETH_ZLEN] = { 0 };
struct iovec iov[3];
...
if (l2len < ETH_ZLEN) {
iov[iovcnt].iov_base = (void *)padding;
iov[iovcnt].iov_len = ETH_ZLEN - l2len;
iovcnt++;
}
and avoid a copy, but that would substantially complicate the
vhost-user path, and it's questionable whether passing a reference
to a further buffer actually causes lower overhead than the simple
copy.
Link: https://bugs.passt.top/show_bug.cgi?id=166
Signed-off-by: Stefano Brivio
...as I'm going to add a new value to it.
Fixes: 95601237ef82 ("tcp: Replace TCP buffer structure by an iovec array")
Signed-off-by: Stefano Brivio
...as I'm going to add a new value to it.
Fixes: 3f9bd867b585 ("udp: Split tap-bound UDP packets into multiple buffers using io vector")
Signed-off-by: Stefano Brivio
Add a further iovec frame part, TCP_IOV_ETH_PAD for TCP and
UDP_IOV_ETH_PAD for UDP, after the payload, make that point to a
zero-filled buffer, and send out a part of it if needed to reach
the minimum frame length given by 802.3, that is, 60 bytes altogether.
The frames we might need to pad are IPv4 only (the IPv6 header is
larger), and are typically TCP ACK segments but can also be small
data segments or datagrams.
Link: https://bugs.passt.top/show_bug.cgi?id=166
Signed-off-by: Stefano Brivio
For both TCP and UDP, we request vhost-user buffers that are large
enough to reach ETH_ZLEN (60 bytes), so padding is just a matter of
increasing the appropriate iov_len and clearing bytes in the buffer
as needed.
Link: https://bugs.passt.top/show_bug.cgi?id=166
Signed-off-by: Stefano Brivio
On Mon, 3 Nov 2025 11:16:08 +0100
Stefano Brivio
IEEE 802.3 requires a minimum frame payload of 46 bytes, without a 802.1Q tag. Add padding for the simple tap_send_single() case using a zero-filled 60-byte buffer and copying data to it if needed.
In theory, we could add a further element in the iovec array, say:
uint8_t padding[ETH_ZLEN] = { 0 }; struct iovec iov[3];
...
if (l2len < ETH_ZLEN) { iov[iovcnt].iov_base = (void *)padding; iov[iovcnt].iov_len = ETH_ZLEN - l2len; iovcnt++; }
and avoid a copy, but that would substantially complicate the vhost-user path, and it's questionable whether passing a reference to a further buffer actually causes lower overhead than the simple copy.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
--- tap.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/tap.c b/tap.c index bb139d6..8d8d84b 100644 --- a/tap.c +++ b/tap.c @@ -131,9 +131,16 @@ unsigned long tap_l2_max_len(const struct ctx *c) void tap_send_single(const struct ctx *c, const void *data, size_t l2len) { uint32_t vnet_len = htonl(l2len); + uint8_t padded[ETH_ZLEN] = { 0 }; struct iovec iov[2]; size_t iovcnt = 0;
+ if (l2len < ETH_ZLEN) { + memcpy(padded, data, l2len); + data = padded; + l2len = ETH_ZLEN; + }
Sorry, botched rebase, this should update vnet_len as well, which I missed on the first attempt. I'll fix this in the next spin.
+ switch (c->mode) { case MODE_PASST: iov[iovcnt] = IOV_OF_LVALUE(vnet_len);
-- Stefano
On Mon, Nov 03, 2025 at 11:16:08AM +0100, Stefano Brivio wrote:
IEEE 802.3 requires a minimum frame payload of 46 bytes, without a 802.1Q tag. Add padding for the simple tap_send_single() case using a zero-filled 60-byte buffer and copying data to it if needed.
In theory, we could add a further element in the iovec array, say:
uint8_t padding[ETH_ZLEN] = { 0 }; struct iovec iov[3];
...
if (l2len < ETH_ZLEN) { iov[iovcnt].iov_base = (void *)padding; iov[iovcnt].iov_len = ETH_ZLEN - l2len; iovcnt++; }
and avoid a copy, but that would substantially complicate the vhost-user path, and it's questionable whether passing a reference to a further buffer actually causes lower overhead than the simple copy.
Plus this is essentially a slow path anyway.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- tap.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/tap.c b/tap.c index bb139d6..8d8d84b 100644 --- a/tap.c +++ b/tap.c @@ -131,9 +131,16 @@ unsigned long tap_l2_max_len(const struct ctx *c) void tap_send_single(const struct ctx *c, const void *data, size_t l2len) { uint32_t vnet_len = htonl(l2len); + uint8_t padded[ETH_ZLEN] = { 0 }; struct iovec iov[2]; size_t iovcnt = 0;
+ if (l2len < ETH_ZLEN) { + memcpy(padded, data, l2len); + data = padded; + l2len = ETH_ZLEN; + } + switch (c->mode) { case MODE_PASST: iov[iovcnt] = IOV_OF_LVALUE(vnet_len); -- 2.43.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, Nov 03, 2025 at 11:16:09AM +0100, Stefano Brivio wrote:
...as I'm going to add a new value to it.
Fixes: 95601237ef82 ("tcp: Replace TCP buffer structure by an iovec array") Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- tcp_internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tcp_internal.h b/tcp_internal.h index f55025c..19e8922 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -57,7 +57,7 @@ #define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->oaddr)) #define CONN_V6(conn) (!CONN_V4(conn))
-/* +/** * enum tcp_iov_parts - I/O vector parts for one TCP frame * @TCP_IOV_TAP tap backend specific header * @TCP_IOV_ETH Ethernet header -- 2.43.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, Nov 03, 2025 at 11:20:13AM +0100, Stefano Brivio wrote:
On Mon, 3 Nov 2025 11:16:08 +0100 Stefano Brivio
wrote: IEEE 802.3 requires a minimum frame payload of 46 bytes, without a 802.1Q tag. Add padding for the simple tap_send_single() case using a zero-filled 60-byte buffer and copying data to it if needed.
In theory, we could add a further element in the iovec array, say:
uint8_t padding[ETH_ZLEN] = { 0 }; struct iovec iov[3];
...
if (l2len < ETH_ZLEN) { iov[iovcnt].iov_base = (void *)padding; iov[iovcnt].iov_len = ETH_ZLEN - l2len; iovcnt++; }
and avoid a copy, but that would substantially complicate the vhost-user path, and it's questionable whether passing a reference to a further buffer actually causes lower overhead than the simple copy.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
--- tap.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/tap.c b/tap.c index bb139d6..8d8d84b 100644 --- a/tap.c +++ b/tap.c @@ -131,9 +131,16 @@ unsigned long tap_l2_max_len(const struct ctx *c) void tap_send_single(const struct ctx *c, const void *data, size_t l2len) { uint32_t vnet_len = htonl(l2len); + uint8_t padded[ETH_ZLEN] = { 0 }; struct iovec iov[2]; size_t iovcnt = 0;
+ if (l2len < ETH_ZLEN) { + memcpy(padded, data, l2len); + data = padded; + l2len = ETH_ZLEN; + }
Sorry, botched rebase, this should update vnet_len as well, which I missed on the first attempt. I'll fix this in the next spin.
Oh... yeah... should have caught that, oops.
+ switch (c->mode) { case MODE_PASST: iov[iovcnt] = IOV_OF_LVALUE(vnet_len);
-- Stefano
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, Nov 03, 2025 at 11:16:10AM +0100, Stefano Brivio wrote:
...as I'm going to add a new value to it.
Fixes: 3f9bd867b585 ("udp: Split tap-bound UDP packets into multiple buffers using io vector") Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- udp.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/udp.c b/udp.c index 9c00950..0e37e77 100644 --- a/udp.c +++ b/udp.c @@ -164,11 +164,11 @@ udp_meta[UDP_MAX_FRAMES];
/** * enum udp_iov_idx - Indices for the buffers making up a single UDP frame - * @UDP_IOV_TAP tap specific header - * @UDP_IOV_ETH Ethernet header - * @UDP_IOV_IP IP (v4/v6) header - * @UDP_IOV_PAYLOAD IP payload (UDP header + data) - * @UDP_NUM_IOVS the number of entries in the iovec array + * @UDP_IOV_TAP tap specific header + * @UDP_IOV_ETH Ethernet header + * @UDP_IOV_IP IP (v4/v6) header + * @UDP_IOV_PAYLOAD IP payload (UDP header + data) + * @UDP_NUM_IOVS the number of entries in the iovec array */ enum udp_iov_idx { UDP_IOV_TAP, -- 2.43.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, Nov 03, 2025 at 11:16:11AM +0100, Stefano Brivio wrote:
Add a further iovec frame part, TCP_IOV_ETH_PAD for TCP and UDP_IOV_ETH_PAD for UDP, after the payload, make that point to a zero-filled buffer, and send out a part of it if needed to reach the minimum frame length given by 802.3, that is, 60 bytes altogether.
The frames we might need to pad are IPv4 only (the IPv6 header is larger), and are typically TCP ACK segments but can also be small data segments or datagrams.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- tcp_buf.c | 23 +++++++++++++++++++++++ tcp_internal.h | 2 ++ udp.c | 21 +++++++++++++++++++++ util.c | 3 +++ util.h | 3 +++ 5 files changed, 52 insertions(+)
diff --git a/tcp_buf.c b/tcp_buf.c index 2058225..5d419d3 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -96,6 +96,7 @@ void tcp_sock_iov_init(const struct ctx *c) iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp_payload_tap_hdr[i]); iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr); iov[TCP_IOV_PAYLOAD].iov_base = &tcp_payload[i]; + iov[TCP_IOV_ETH_PAD].iov_base = eth_pad; } }
@@ -144,6 +145,22 @@ void tcp_payload_flush(const struct ctx *c) tcp_payload_used = 0; }
+/** + * tcp_l2_buf_pad() - Calculate padding to send out of padding (zero) buffer + * @iov: Pointer to iovec of frame parts we're about to send + */ +static void tcp_l2_buf_pad(struct iovec *iov) +{ + size_t l2len = iov[TCP_IOV_ETH].iov_len + + iov[TCP_IOV_IP].iov_len + + iov[TCP_IOV_PAYLOAD].iov_len; + + if (l2len < ETH_ZLEN) + iov[TCP_IOV_ETH_PAD].iov_len = ETH_ZLEN - l2len; + else + iov[TCP_IOV_ETH_PAD].iov_len = 0; +} + /** * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers * @c: Execution context @@ -212,6 +229,8 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) iov[TCP_IOV_PAYLOAD].iov_len = l4len; tcp_l2_buf_fill_headers(c, conn, iov, NULL, seq, false);
+ tcp_l2_buf_pad(iov); + if (flags & DUP_ACK) { struct iovec *dup_iov = tcp_l2_iov[tcp_payload_used]; tcp_frame_conns[tcp_payload_used++] = conn; @@ -223,6 +242,7 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) memcpy(dup_iov[TCP_IOV_PAYLOAD].iov_base, iov[TCP_IOV_PAYLOAD].iov_base, l4len); dup_iov[TCP_IOV_PAYLOAD].iov_len = l4len; + dup_iov[TCP_IOV_ETH_PAD].iov_len = iov[TCP_IOV_ETH_PAD].iov_len; }
if (tcp_payload_used > TCP_FRAMES_MEM - 2) @@ -270,6 +290,9 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, payload->th.psh = push; iov[TCP_IOV_PAYLOAD].iov_len = dlen + sizeof(struct tcphdr); tcp_l2_buf_fill_headers(c, conn, iov, check, seq, false); + + tcp_l2_buf_pad(iov); + if (++tcp_payload_used > TCP_FRAMES_MEM - 1) tcp_payload_flush(c); } diff --git a/tcp_internal.h b/tcp_internal.h index 19e8922..5f8fb35 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -63,6 +63,7 @@ * @TCP_IOV_ETH Ethernet header * @TCP_IOV_IP IP (v4/v6) header * @TCP_IOV_PAYLOAD IP payload (TCP header + data) + * @TCP_IOV_ETH_PAD Ethernet (802.3) padding to 60 bytes * @TCP_NUM_IOVS the number of entries in the iovec array */ enum tcp_iov_parts { @@ -70,6 +71,7 @@ enum tcp_iov_parts { TCP_IOV_ETH = 1, TCP_IOV_IP = 2, TCP_IOV_PAYLOAD = 3, + TCP_IOV_ETH_PAD = 4, TCP_NUM_IOVS };
diff --git a/udp.c b/udp.c index 0e37e77..e7eb825 100644 --- a/udp.c +++ b/udp.c @@ -168,6 +168,7 @@ udp_meta[UDP_MAX_FRAMES]; * @UDP_IOV_ETH Ethernet header * @UDP_IOV_IP IP (v4/v6) header * @UDP_IOV_PAYLOAD IP payload (UDP header + data) + * @UDP_IOV_ETH_PAD Ethernet (802.3) padding to 60 bytes * @UDP_NUM_IOVS the number of entries in the iovec array */ enum udp_iov_idx { @@ -175,6 +176,7 @@ enum udp_iov_idx { UDP_IOV_ETH, UDP_IOV_IP, UDP_IOV_PAYLOAD, + UDP_IOV_ETH_PAD, UDP_NUM_IOVS, };
@@ -239,6 +241,7 @@ static void udp_iov_init_one(const struct ctx *c, size_t i) tiov[UDP_IOV_ETH] = IOV_OF_LVALUE(udp_eth_hdr[i]); tiov[UDP_IOV_TAP] = tap_hdr_iov(c, &meta->taph); tiov[UDP_IOV_PAYLOAD].iov_base = payload; + tiov[UDP_IOV_ETH_PAD].iov_base = eth_pad;
mh->msg_iov = siov; mh->msg_iovlen = 1; @@ -344,6 +347,22 @@ size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp, return l4len; }
+/** + * udp_tap_pad() - Calculate padding to send out of padding (zero) buffer + * @iov: Pointer to iovec of frame parts we're about to send + */ +static void udp_tap_pad(struct iovec *iov) +{ + size_t l2len = iov[UDP_IOV_ETH].iov_len + + iov[UDP_IOV_IP].iov_len + + iov[UDP_IOV_PAYLOAD].iov_len; + + if (l2len < ETH_ZLEN) + iov[UDP_IOV_ETH_PAD].iov_len = ETH_ZLEN - l2len; + else + iov[UDP_IOV_ETH_PAD].iov_len = 0; +} + /** * udp_tap_prepare() - Convert one datagram into a tap frame * @mmh: Receiving mmsghdr array @@ -379,6 +398,8 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, (*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip4h); } (*tap_iov)[UDP_IOV_PAYLOAD].iov_len = l4len; + + udp_tap_pad(*tap_iov); }
/** diff --git a/util.c b/util.c index 44c21a3..3d82093 100644 --- a/util.c +++ b/util.c @@ -39,6 +39,9 @@ #include
#endif +/* Zero-filled buffer to pad 802.3 frames, up to 60 (ETH_ZLEN) bytes */ +uint8_t eth_pad[ETH_ZLEN] = { 0 }; + /** * sock_l4_sa() - Create and bind socket to socket address, add to epoll list * @c: Execution context diff --git a/util.h b/util.h index a0b2ada..dc7ebcf 100644 --- a/util.h +++ b/util.h @@ -17,6 +17,7 @@ #include
#include #include +#include #include "log.h"
@@ -152,6 +153,8 @@ void abort_with_msg(const char *fmt, ...) #define ntohll(x) (be64toh((x))) #define htonll(x) (htobe64((x)))
+extern uint8_t eth_pad[ETH_ZLEN]; + /** * ntohl_unaligned() - Read 32-bit BE value from a possibly unaligned address * @p: Pointer to the BE value in memory -- 2.43.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On 11/3/25 11:16, Stefano Brivio wrote:
...as I'm going to add a new value to it.
Fixes: 95601237ef82 ("tcp: Replace TCP buffer structure by an iovec array") Signed-off-by: Stefano Brivio
--- tcp_internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tcp_internal.h b/tcp_internal.h index f55025c..19e8922 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -57,7 +57,7 @@ #define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->oaddr)) #define CONN_V6(conn) (!CONN_V4(conn))
-/* +/** * enum tcp_iov_parts - I/O vector parts for one TCP frame * @TCP_IOV_TAP tap backend specific header * @TCP_IOV_ETH Ethernet header
Reviewed-by: Laurent Vivier
On 11/3/25 11:16, Stefano Brivio wrote:
...as I'm going to add a new value to it.
Fixes: 3f9bd867b585 ("udp: Split tap-bound UDP packets into multiple buffers using io vector") Signed-off-by: Stefano Brivio
--- udp.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/udp.c b/udp.c index 9c00950..0e37e77 100644 --- a/udp.c +++ b/udp.c @@ -164,11 +164,11 @@ udp_meta[UDP_MAX_FRAMES];
/** * enum udp_iov_idx - Indices for the buffers making up a single UDP frame - * @UDP_IOV_TAP tap specific header - * @UDP_IOV_ETH Ethernet header - * @UDP_IOV_IP IP (v4/v6) header - * @UDP_IOV_PAYLOAD IP payload (UDP header + data) - * @UDP_NUM_IOVS the number of entries in the iovec array + * @UDP_IOV_TAP tap specific header + * @UDP_IOV_ETH Ethernet header + * @UDP_IOV_IP IP (v4/v6) header + * @UDP_IOV_PAYLOAD IP payload (UDP header + data) + * @UDP_NUM_IOVS the number of entries in the iovec array */ enum udp_iov_idx { UDP_IOV_TAP,
Reviewed-by: Laurent Vivier
On 11/3/25 11:16, Stefano Brivio wrote:
IEEE 802.3 requires a minimum frame payload of 46 bytes, without a 802.1Q tag. Add padding for the simple tap_send_single() case using a zero-filled 60-byte buffer and copying data to it if needed.
In theory, we could add a further element in the iovec array, say:
uint8_t padding[ETH_ZLEN] = { 0 }; struct iovec iov[3];
...
if (l2len < ETH_ZLEN) { iov[iovcnt].iov_base = (void *)padding; iov[iovcnt].iov_len = ETH_ZLEN - l2len; iovcnt++; }
and avoid a copy, but that would substantially complicate the vhost-user path, and it's questionable whether passing a reference to a further buffer actually causes lower overhead than the simple copy.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
--- tap.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/tap.c b/tap.c index bb139d6..8d8d84b 100644 --- a/tap.c +++ b/tap.c @@ -131,9 +131,16 @@ unsigned long tap_l2_max_len(const struct ctx *c) void tap_send_single(const struct ctx *c, const void *data, size_t l2len) { uint32_t vnet_len = htonl(l2len); + uint8_t padded[ETH_ZLEN] = { 0 }; struct iovec iov[2]; size_t iovcnt = 0;
+ if (l2len < ETH_ZLEN) { + memcpy(padded, data, l2len); + data = padded; + l2len = ETH_ZLEN; + } + switch (c->mode) { case MODE_PASST: iov[iovcnt] = IOV_OF_LVALUE(vnet_len);
with vnet_len updated:
Reviewed-by: Laurent Vivier
On 11/3/25 11:16, Stefano Brivio wrote:
Add a further iovec frame part, TCP_IOV_ETH_PAD for TCP and UDP_IOV_ETH_PAD for UDP, after the payload, make that point to a zero-filled buffer, and send out a part of it if needed to reach the minimum frame length given by 802.3, that is, 60 bytes altogether.
The frames we might need to pad are IPv4 only (the IPv6 header is larger), and are typically TCP ACK segments but can also be small data segments or datagrams.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
--- tcp_buf.c | 23 +++++++++++++++++++++++ tcp_internal.h | 2 ++ udp.c | 21 +++++++++++++++++++++ util.c | 3 +++ util.h | 3 +++ 5 files changed, 52 insertions(+)
Reviewed-by: Laurent Vivier
On 11/3/25 11:16, Stefano Brivio wrote:
For both TCP and UDP, we request vhost-user buffers that are large enough to reach ETH_ZLEN (60 bytes), so padding is just a matter of increasing the appropriate iov_len and clearing bytes in the buffer as needed.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
--- tcp.c | 2 -- tcp_internal.h | 1 + tcp_vu.c | 27 +++++++++++++++++++++++++++ udp_vu.c | 11 ++++++++++- 4 files changed, 38 insertions(+), 3 deletions(-) diff --git a/tcp.c b/tcp.c index e91c0cf..039688d 100644 --- a/tcp.c +++ b/tcp.c @@ -335,8 +335,6 @@ enum { }; #endif
-/* MSS rounding: see SET_MSS() */ -#define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ diff --git a/tcp_internal.h b/tcp_internal.h index 5f8fb35..d2295c9 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -12,6 +12,7 @@ #define BUF_DISCARD_SIZE (1 << 20) #define DISCARD_IOV_NUM DIV_ROUND_UP(MAX_WINDOW, BUF_DISCARD_SIZE)
+#define MSS_DEFAULT /* and minimum */ 536 /* as it comes from minimum MTU */ #define MSS4 ROUND_DOWN(IP_MAX_MTU - \ sizeof(struct tcphdr) - \ sizeof(struct iphdr), \ diff --git a/tcp_vu.c b/tcp_vu.c index 1c81ce3..7239401 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -60,6 +60,29 @@ static size_t tcp_vu_hdrlen(bool v6) return hdrlen; }
+/** + * tcp_vu_pad() - Pad 802.3 frame to minimum length (60 bytes) if needed + * @iov: iovec array storing 802.3 frame with TCP segment inside + * @cnt: Number of entries in @iov + */ +static void tcp_vu_pad(struct iovec *iov, size_t cnt) +{ + size_t l2len, pad; + + ASSERT(iov_size(iov, cnt) >= sizeof(struct virtio_net_hdr_mrg_rxbuf)); + l2len = iov_size(iov, cnt) - sizeof(struct virtio_net_hdr_mrg_rxbuf); + if (l2len >= ETH_ZLEN) + return; + + pad = ETH_ZLEN - l2len; + + /* tcp_vu_sock_recv() requests at least MSS-sized vhost-user buffers */ + static_assert(ETH_ZLEN <= MSS_DEFAULT); + + memset(&iov[cnt - 1].iov_base + iov[cnt - 1].iov_len, 0, pad);
I think it should be memset((char *)iov[cnt - 1].iov_base + iov[cnt - 1].iov_len, 0, pad);
+ iov[cnt - 1].iov_len += pad; +} + /** * tcp_vu_send_flag() - Send segment with flags to vhost-user (no payload) * @c: Execution context @@ -138,6 +161,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) tcp_fill_headers(c, conn, NULL, eh, ip4h, ip6h, th, &payload, NULL, seq, !*c->pcap);
+ tcp_vu_pad(&flags_elem[0].in_sg[0], 1); + if (*c->pcap) { pcap_iov(&flags_elem[0].in_sg[0], 1, sizeof(struct virtio_net_hdr_mrg_rxbuf)); @@ -456,6 +481,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
+ tcp_vu_pad(iov, buf_cnt); + if (*c->pcap) { pcap_iov(iov, buf_cnt, sizeof(struct virtio_net_hdr_mrg_rxbuf)); diff --git a/udp_vu.c b/udp_vu.c index 099677f..1b60860 100644 --- a/udp_vu.c +++ b/udp_vu.c @@ -72,8 +72,8 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, { const struct vu_dev *vdev = c->vdev; int iov_cnt, idx, iov_used; + size_t off, hdrlen, l2len; struct msghdr msg = { 0 }; - size_t off, hdrlen;
ASSERT(!c->no_udp);
@@ -116,6 +116,15 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, iov_vu[idx].iov_len = off; iov_used = idx + !!off;
+ /* pad 802.3 frame to 60 bytes if needed */ + l2len = *dlen + hdrlen - sizeof(struct virtio_net_hdr_mrg_rxbuf); + if (l2len < ETH_ZLEN) { + size_t pad = ETH_ZLEN - l2len; + + iov_vu[idx].iov_len += pad; + memset(&iov_vu[idx].iov_base + off, 0, pad);
It should be: memset((char *)iov_vu[idx].iov_base + off, 0, pad); Perhaps clearer to write (off is not obvious, we need to check above): memset((char *)iov_vu[idx].iov_base + iov_vu[idx].iov_len, 0, pad); iov_vu[idx].iov_len += pad;
+ } + vu_set_vnethdr(vdev, iov_vu[0].iov_base, iov_used);
/* release unused buffers */ Thanks,Laurent
On Tue, 4 Nov 2025 16:50:43 +0100
Laurent Vivier
On 11/3/25 11:16, Stefano Brivio wrote:
For both TCP and UDP, we request vhost-user buffers that are large enough to reach ETH_ZLEN (60 bytes), so padding is just a matter of increasing the appropriate iov_len and clearing bytes in the buffer as needed.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
--- tcp.c | 2 -- tcp_internal.h | 1 + tcp_vu.c | 27 +++++++++++++++++++++++++++ udp_vu.c | 11 ++++++++++- 4 files changed, 38 insertions(+), 3 deletions(-) diff --git a/tcp.c b/tcp.c index e91c0cf..039688d 100644 --- a/tcp.c +++ b/tcp.c @@ -335,8 +335,6 @@ enum { }; #endif
-/* MSS rounding: see SET_MSS() */ -#define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ diff --git a/tcp_internal.h b/tcp_internal.h index 5f8fb35..d2295c9 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -12,6 +12,7 @@ #define BUF_DISCARD_SIZE (1 << 20) #define DISCARD_IOV_NUM DIV_ROUND_UP(MAX_WINDOW, BUF_DISCARD_SIZE)
+#define MSS_DEFAULT /* and minimum */ 536 /* as it comes from minimum MTU */ #define MSS4 ROUND_DOWN(IP_MAX_MTU - \ sizeof(struct tcphdr) - \ sizeof(struct iphdr), \ diff --git a/tcp_vu.c b/tcp_vu.c index 1c81ce3..7239401 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -60,6 +60,29 @@ static size_t tcp_vu_hdrlen(bool v6) return hdrlen; }
+/** + * tcp_vu_pad() - Pad 802.3 frame to minimum length (60 bytes) if needed + * @iov: iovec array storing 802.3 frame with TCP segment inside + * @cnt: Number of entries in @iov + */ +static void tcp_vu_pad(struct iovec *iov, size_t cnt) +{ + size_t l2len, pad; + + ASSERT(iov_size(iov, cnt) >= sizeof(struct virtio_net_hdr_mrg_rxbuf)); + l2len = iov_size(iov, cnt) - sizeof(struct virtio_net_hdr_mrg_rxbuf); + if (l2len >= ETH_ZLEN) + return; + + pad = ETH_ZLEN - l2len; + + /* tcp_vu_sock_recv() requests at least MSS-sized vhost-user buffers */ + static_assert(ETH_ZLEN <= MSS_DEFAULT); + + memset(&iov[cnt - 1].iov_base + iov[cnt - 1].iov_len, 0, pad);
I think it should be
memset((char *)iov[cnt - 1].iov_base + iov[cnt - 1].iov_len, 0, pad);
Right, thanks, I always forget that sizeof(void) being 1 is a gcc extension: https://gcc.gnu.org/onlinedocs/gcc/Pointer-Arith.html What's rather confusing, actually, is that even if I explicitly enable -Wpointer-arith, I don't get a warning for that. Any clue?
+ iov[cnt - 1].iov_len += pad; +} + /** * tcp_vu_send_flag() - Send segment with flags to vhost-user (no payload) * @c: Execution context @@ -138,6 +161,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) tcp_fill_headers(c, conn, NULL, eh, ip4h, ip6h, th, &payload, NULL, seq, !*c->pcap);
+ tcp_vu_pad(&flags_elem[0].in_sg[0], 1); + if (*c->pcap) { pcap_iov(&flags_elem[0].in_sg[0], 1, sizeof(struct virtio_net_hdr_mrg_rxbuf)); @@ -456,6 +481,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
+ tcp_vu_pad(iov, buf_cnt); + if (*c->pcap) { pcap_iov(iov, buf_cnt, sizeof(struct virtio_net_hdr_mrg_rxbuf)); diff --git a/udp_vu.c b/udp_vu.c index 099677f..1b60860 100644 --- a/udp_vu.c +++ b/udp_vu.c @@ -72,8 +72,8 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, { const struct vu_dev *vdev = c->vdev; int iov_cnt, idx, iov_used; + size_t off, hdrlen, l2len; struct msghdr msg = { 0 }; - size_t off, hdrlen;
ASSERT(!c->no_udp);
@@ -116,6 +116,15 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, iov_vu[idx].iov_len = off; iov_used = idx + !!off;
+ /* pad 802.3 frame to 60 bytes if needed */ + l2len = *dlen + hdrlen - sizeof(struct virtio_net_hdr_mrg_rxbuf); + if (l2len < ETH_ZLEN) { + size_t pad = ETH_ZLEN - l2len; + + iov_vu[idx].iov_len += pad; + memset(&iov_vu[idx].iov_base + off, 0, pad);
It should be:
memset((char *)iov_vu[idx].iov_base + off, 0, pad);
Fixing as well.
Perhaps clearer to write (off is not obvious, we need to check above):
memset((char *)iov_vu[idx].iov_base + iov_vu[idx].iov_len, 0, pad); iov_vu[idx].iov_len += pad;
Definitely. I'll change this in v2. -- Stefano
On 11/4/25 17:09, Stefano Brivio wrote:
On Tue, 4 Nov 2025 16:50:43 +0100 Laurent Vivier
wrote: On 11/3/25 11:16, Stefano Brivio wrote:
For both TCP and UDP, we request vhost-user buffers that are large enough to reach ETH_ZLEN (60 bytes), so padding is just a matter of increasing the appropriate iov_len and clearing bytes in the buffer as needed.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
--- tcp.c | 2 -- tcp_internal.h | 1 + tcp_vu.c | 27 +++++++++++++++++++++++++++ udp_vu.c | 11 ++++++++++- 4 files changed, 38 insertions(+), 3 deletions(-) diff --git a/tcp.c b/tcp.c index e91c0cf..039688d 100644 --- a/tcp.c +++ b/tcp.c @@ -335,8 +335,6 @@ enum { }; #endif
-/* MSS rounding: see SET_MSS() */ -#define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ diff --git a/tcp_internal.h b/tcp_internal.h index 5f8fb35..d2295c9 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -12,6 +12,7 @@ #define BUF_DISCARD_SIZE (1 << 20) #define DISCARD_IOV_NUM DIV_ROUND_UP(MAX_WINDOW, BUF_DISCARD_SIZE)
+#define MSS_DEFAULT /* and minimum */ 536 /* as it comes from minimum MTU */ #define MSS4 ROUND_DOWN(IP_MAX_MTU - \ sizeof(struct tcphdr) - \ sizeof(struct iphdr), \ diff --git a/tcp_vu.c b/tcp_vu.c index 1c81ce3..7239401 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -60,6 +60,29 @@ static size_t tcp_vu_hdrlen(bool v6) return hdrlen; }
+/** + * tcp_vu_pad() - Pad 802.3 frame to minimum length (60 bytes) if needed + * @iov: iovec array storing 802.3 frame with TCP segment inside + * @cnt: Number of entries in @iov + */ +static void tcp_vu_pad(struct iovec *iov, size_t cnt) +{ + size_t l2len, pad; + + ASSERT(iov_size(iov, cnt) >= sizeof(struct virtio_net_hdr_mrg_rxbuf)); + l2len = iov_size(iov, cnt) - sizeof(struct virtio_net_hdr_mrg_rxbuf); + if (l2len >= ETH_ZLEN) + return; + + pad = ETH_ZLEN - l2len; + + /* tcp_vu_sock_recv() requests at least MSS-sized vhost-user buffers */ + static_assert(ETH_ZLEN <= MSS_DEFAULT); + + memset(&iov[cnt - 1].iov_base + iov[cnt - 1].iov_len, 0, pad);
I think it should be
memset((char *)iov[cnt - 1].iov_base + iov[cnt - 1].iov_len, 0, pad);
Right, thanks, I always forget that sizeof(void) being 1 is a gcc extension:
https://gcc.gnu.org/onlinedocs/gcc/Pointer-Arith.html
What's rather confusing, actually, is that even if I explicitly enable -Wpointer-arith, I don't get a warning for that. Any clue?
in fact it's sizeof(void **) as iov[cnt - 1].iov_base is (void *) and you take &(void *). Normally gcc spits out warnings when we do .iov_base + .iov_len, but as you take address of iov_base all is fine :P Thanks, Laurent
On Tue, 4 Nov 2025 17:28:52 +0100
Laurent Vivier
On 11/4/25 17:09, Stefano Brivio wrote:
On Tue, 4 Nov 2025 16:50:43 +0100 Laurent Vivier
wrote: On 11/3/25 11:16, Stefano Brivio wrote:
For both TCP and UDP, we request vhost-user buffers that are large enough to reach ETH_ZLEN (60 bytes), so padding is just a matter of increasing the appropriate iov_len and clearing bytes in the buffer as needed.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
--- tcp.c | 2 -- tcp_internal.h | 1 + tcp_vu.c | 27 +++++++++++++++++++++++++++ udp_vu.c | 11 ++++++++++- 4 files changed, 38 insertions(+), 3 deletions(-) diff --git a/tcp.c b/tcp.c index e91c0cf..039688d 100644 --- a/tcp.c +++ b/tcp.c @@ -335,8 +335,6 @@ enum { }; #endif
-/* MSS rounding: see SET_MSS() */ -#define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ diff --git a/tcp_internal.h b/tcp_internal.h index 5f8fb35..d2295c9 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -12,6 +12,7 @@ #define BUF_DISCARD_SIZE (1 << 20) #define DISCARD_IOV_NUM DIV_ROUND_UP(MAX_WINDOW, BUF_DISCARD_SIZE)
+#define MSS_DEFAULT /* and minimum */ 536 /* as it comes from minimum MTU */ #define MSS4 ROUND_DOWN(IP_MAX_MTU - \ sizeof(struct tcphdr) - \ sizeof(struct iphdr), \ diff --git a/tcp_vu.c b/tcp_vu.c index 1c81ce3..7239401 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -60,6 +60,29 @@ static size_t tcp_vu_hdrlen(bool v6) return hdrlen; }
+/** + * tcp_vu_pad() - Pad 802.3 frame to minimum length (60 bytes) if needed + * @iov: iovec array storing 802.3 frame with TCP segment inside + * @cnt: Number of entries in @iov + */ +static void tcp_vu_pad(struct iovec *iov, size_t cnt) +{ + size_t l2len, pad; + + ASSERT(iov_size(iov, cnt) >= sizeof(struct virtio_net_hdr_mrg_rxbuf)); + l2len = iov_size(iov, cnt) - sizeof(struct virtio_net_hdr_mrg_rxbuf); + if (l2len >= ETH_ZLEN) + return; + + pad = ETH_ZLEN - l2len; + + /* tcp_vu_sock_recv() requests at least MSS-sized vhost-user buffers */ + static_assert(ETH_ZLEN <= MSS_DEFAULT); + + memset(&iov[cnt - 1].iov_base + iov[cnt - 1].iov_len, 0, pad);
I think it should be
memset((char *)iov[cnt - 1].iov_base + iov[cnt - 1].iov_len, 0, pad);
Right, thanks, I always forget that sizeof(void) being 1 is a gcc extension:
https://gcc.gnu.org/onlinedocs/gcc/Pointer-Arith.html
What's rather confusing, actually, is that even if I explicitly enable -Wpointer-arith, I don't get a warning for that. Any clue?
in fact it's sizeof(void **) as iov[cnt - 1].iov_base is (void *) and you take &(void *).
Normally gcc spits out warnings when we do .iov_base + .iov_len, but as you take address of iov_base all is fine :P
Ouch, wow, how does it even work? I checked captures of most functional tests with vhost-user, connections worked and the right ACK segments had the right amount of padding. I guess it's just that there was nothing fundamental at the resulting address and we already happened to have zeroes in the buffers. Thanks for spotting that, I'll fix in v2. -- Stefano
On Mon, Nov 03, 2025 at 11:16:12AM +0100, Stefano Brivio wrote:
For both TCP and UDP, we request vhost-user buffers that are large enough to reach ETH_ZLEN (60 bytes), so padding is just a matter of increasing the appropriate iov_len and clearing bytes in the buffer as needed.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
I think this is correct, apart from the nasty bug Laurent spotted. I'm less certain if this is the most natural way to do it.
--- tcp.c | 2 -- tcp_internal.h | 1 + tcp_vu.c | 27 +++++++++++++++++++++++++++ udp_vu.c | 11 ++++++++++- 4 files changed, 38 insertions(+), 3 deletions(-)
diff --git a/tcp.c b/tcp.c index e91c0cf..039688d 100644 --- a/tcp.c +++ b/tcp.c @@ -335,8 +335,6 @@ enum { }; #endif
-/* MSS rounding: see SET_MSS() */ -#define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ diff --git a/tcp_internal.h b/tcp_internal.h index 5f8fb35..d2295c9 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -12,6 +12,7 @@ #define BUF_DISCARD_SIZE (1 << 20) #define DISCARD_IOV_NUM DIV_ROUND_UP(MAX_WINDOW, BUF_DISCARD_SIZE)
+#define MSS_DEFAULT /* and minimum */ 536 /* as it comes from minimum MTU */ #define MSS4 ROUND_DOWN(IP_MAX_MTU - \ sizeof(struct tcphdr) - \ sizeof(struct iphdr), \ diff --git a/tcp_vu.c b/tcp_vu.c index 1c81ce3..7239401 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -60,6 +60,29 @@ static size_t tcp_vu_hdrlen(bool v6) return hdrlen; }
+/** + * tcp_vu_pad() - Pad 802.3 frame to minimum length (60 bytes) if needed + * @iov: iovec array storing 802.3 frame with TCP segment inside + * @cnt: Number of entries in @iov + */ +static void tcp_vu_pad(struct iovec *iov, size_t cnt) +{ + size_t l2len, pad; + + ASSERT(iov_size(iov, cnt) >= sizeof(struct virtio_net_hdr_mrg_rxbuf)); + l2len = iov_size(iov, cnt) - sizeof(struct virtio_net_hdr_mrg_rxbuf);
Re-obtaining l2len from iov_size() seems kind of awkward, since the callers should already know the length - they've just used it to populate iov_len.
+ if (l2len >= ETH_ZLEN) + return;
+ + pad = ETH_ZLEN - l2len; + + /* tcp_vu_sock_recv() requests at least MSS-sized vhost-user buffers */ + static_assert(ETH_ZLEN <= MSS_DEFAULT);
So, this is true for the data path, but not AFAICT for the flags path. There _is_ still enough space in this case, because we request space for (tcp_vu_hdrlen() + sizeof(struct tcp_syn_opts)) which works out to: ETH_HLEN 14 + IP header 20 + TCP header 20 + tcp_syn_opts 8 ---- 62 > ETH_ZLEN But the comment and assert are misleading. It seems like it would make more sense to clamp ETH_ZLEN as a lower length bound before we vu_collect() the buffers. Or indeed, like we should be calculating l2len already including the clamping.
+ + memset(&iov[cnt - 1].iov_base + iov[cnt - 1].iov_len, 0, pad); + iov[cnt - 1].iov_len += pad; +} + /** * tcp_vu_send_flag() - Send segment with flags to vhost-user (no payload) * @c: Execution context @@ -138,6 +161,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) tcp_fill_headers(c, conn, NULL, eh, ip4h, ip6h, th, &payload, NULL, seq, !*c->pcap);
+ tcp_vu_pad(&flags_elem[0].in_sg[0], 1); + if (*c->pcap) { pcap_iov(&flags_elem[0].in_sg[0], 1, sizeof(struct virtio_net_hdr_mrg_rxbuf)); @@ -456,6 +481,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
+ tcp_vu_pad(iov, buf_cnt); + if (*c->pcap) { pcap_iov(iov, buf_cnt, sizeof(struct virtio_net_hdr_mrg_rxbuf)); diff --git a/udp_vu.c b/udp_vu.c index 099677f..1b60860 100644 --- a/udp_vu.c +++ b/udp_vu.c @@ -72,8 +72,8 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, { const struct vu_dev *vdev = c->vdev; int iov_cnt, idx, iov_used; + size_t off, hdrlen, l2len; struct msghdr msg = { 0 }; - size_t off, hdrlen;
ASSERT(!c->no_udp);
@@ -116,6 +116,15 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, iov_vu[idx].iov_len = off; iov_used = idx + !!off;
+ /* pad 802.3 frame to 60 bytes if needed */ + l2len = *dlen + hdrlen - sizeof(struct virtio_net_hdr_mrg_rxbuf); + if (l2len < ETH_ZLEN) { + size_t pad = ETH_ZLEN - l2len; + + iov_vu[idx].iov_len += pad; + memset(&iov_vu[idx].iov_base + off, 0, pad); + } + vu_set_vnethdr(vdev, iov_vu[0].iov_base, iov_used);
/* release unused buffers */ -- 2.43.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, 5 Nov 2025 14:49:59 +1100
David Gibson
On Mon, Nov 03, 2025 at 11:16:12AM +0100, Stefano Brivio wrote:
For both TCP and UDP, we request vhost-user buffers that are large enough to reach ETH_ZLEN (60 bytes), so padding is just a matter of increasing the appropriate iov_len and clearing bytes in the buffer as needed.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
I think this is correct, apart from the nasty bug Laurent spotted.
I'm less certain if this is the most natural way to do it.
--- tcp.c | 2 -- tcp_internal.h | 1 + tcp_vu.c | 27 +++++++++++++++++++++++++++ udp_vu.c | 11 ++++++++++- 4 files changed, 38 insertions(+), 3 deletions(-)
diff --git a/tcp.c b/tcp.c index e91c0cf..039688d 100644 --- a/tcp.c +++ b/tcp.c @@ -335,8 +335,6 @@ enum { }; #endif
-/* MSS rounding: see SET_MSS() */ -#define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ diff --git a/tcp_internal.h b/tcp_internal.h index 5f8fb35..d2295c9 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -12,6 +12,7 @@ #define BUF_DISCARD_SIZE (1 << 20) #define DISCARD_IOV_NUM DIV_ROUND_UP(MAX_WINDOW, BUF_DISCARD_SIZE)
+#define MSS_DEFAULT /* and minimum */ 536 /* as it comes from minimum MTU */ #define MSS4 ROUND_DOWN(IP_MAX_MTU - \ sizeof(struct tcphdr) - \ sizeof(struct iphdr), \ diff --git a/tcp_vu.c b/tcp_vu.c index 1c81ce3..7239401 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -60,6 +60,29 @@ static size_t tcp_vu_hdrlen(bool v6) return hdrlen; }
+/** + * tcp_vu_pad() - Pad 802.3 frame to minimum length (60 bytes) if needed + * @iov: iovec array storing 802.3 frame with TCP segment inside + * @cnt: Number of entries in @iov + */ +static void tcp_vu_pad(struct iovec *iov, size_t cnt) +{ + size_t l2len, pad; + + ASSERT(iov_size(iov, cnt) >= sizeof(struct virtio_net_hdr_mrg_rxbuf)); + l2len = iov_size(iov, cnt) - sizeof(struct virtio_net_hdr_mrg_rxbuf);
Re-obtaining l2len from iov_size() seems kind of awkward, since the callers should already know the length - they've just used it to populate iov_len.
That's only the case for tcp_vu_send_flag() though, because tcp_vu_data_from_sock() can use split buffers and iov_len of the first element is not the same as the whole frame length. That is, you could (very much in theory) have iov_len set to 50 for the first iov item, set to 4 for the second iov item, and the frame needs padding, but you can't tell from the first iov item itself.
+ if (l2len >= ETH_ZLEN) + return;
+ + pad = ETH_ZLEN - l2len; + + /* tcp_vu_sock_recv() requests at least MSS-sized vhost-user buffers */ + static_assert(ETH_ZLEN <= MSS_DEFAULT);
So, this is true for the data path, but not AFAICT for the flags path.
There _is_ still enough space in this case, because we request space for (tcp_vu_hdrlen() + sizeof(struct tcp_syn_opts)) which works out to: ETH_HLEN 14 + IP header 20 + TCP header 20 + tcp_syn_opts 8 ---- 62 > ETH_ZLEN
But the comment and assert are misleading.
Dropped, in favour of:
It seems like it would make more sense to clamp ETH_ZLEN as a lower length bound before we vu_collect() the buffers.
this.
Or indeed, like we should be calculating l2len already including the clamping.
That's not trivial to do for the data path, I think (see above). I think it would be doable with a rework of the tcp_vu_data_from_sock() loop but I'd say it's beyond the scope of this series.
+ memset(&iov[cnt - 1].iov_base + iov[cnt - 1].iov_len, 0, pad); + iov[cnt - 1].iov_len += pad; +} + /** * tcp_vu_send_flag() - Send segment with flags to vhost-user (no payload) * @c: Execution context @@ -138,6 +161,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) tcp_fill_headers(c, conn, NULL, eh, ip4h, ip6h, th, &payload, NULL, seq, !*c->pcap);
+ tcp_vu_pad(&flags_elem[0].in_sg[0], 1); + if (*c->pcap) { pcap_iov(&flags_elem[0].in_sg[0], 1, sizeof(struct virtio_net_hdr_mrg_rxbuf)); @@ -456,6 +481,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
+ tcp_vu_pad(iov, buf_cnt); + if (*c->pcap) { pcap_iov(iov, buf_cnt, sizeof(struct virtio_net_hdr_mrg_rxbuf)); diff --git a/udp_vu.c b/udp_vu.c index 099677f..1b60860 100644 --- a/udp_vu.c +++ b/udp_vu.c @@ -72,8 +72,8 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, { const struct vu_dev *vdev = c->vdev; int iov_cnt, idx, iov_used; + size_t off, hdrlen, l2len; struct msghdr msg = { 0 }; - size_t off, hdrlen;
ASSERT(!c->no_udp);
@@ -116,6 +116,15 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, iov_vu[idx].iov_len = off; iov_used = idx + !!off;
+ /* pad 802.3 frame to 60 bytes if needed */ + l2len = *dlen + hdrlen - sizeof(struct virtio_net_hdr_mrg_rxbuf); + if (l2len < ETH_ZLEN) { + size_t pad = ETH_ZLEN - l2len; + + iov_vu[idx].iov_len += pad; + memset(&iov_vu[idx].iov_base + off, 0, pad); + } + vu_set_vnethdr(vdev, iov_vu[0].iov_base, iov_used);
/* release unused buffers */ -- 2.43.0
-- Stefano
On Fri, Dec 05, 2025 at 01:51:43AM +0100, Stefano Brivio wrote:
On Wed, 5 Nov 2025 14:49:59 +1100 David Gibson
wrote: On Mon, Nov 03, 2025 at 11:16:12AM +0100, Stefano Brivio wrote:
For both TCP and UDP, we request vhost-user buffers that are large enough to reach ETH_ZLEN (60 bytes), so padding is just a matter of increasing the appropriate iov_len and clearing bytes in the buffer as needed.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
I think this is correct, apart from the nasty bug Laurent spotted.
I'm less certain if this is the most natural way to do it.
--- tcp.c | 2 -- tcp_internal.h | 1 + tcp_vu.c | 27 +++++++++++++++++++++++++++ udp_vu.c | 11 ++++++++++- 4 files changed, 38 insertions(+), 3 deletions(-)
diff --git a/tcp.c b/tcp.c index e91c0cf..039688d 100644 --- a/tcp.c +++ b/tcp.c @@ -335,8 +335,6 @@ enum { }; #endif
-/* MSS rounding: see SET_MSS() */ -#define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ diff --git a/tcp_internal.h b/tcp_internal.h index 5f8fb35..d2295c9 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -12,6 +12,7 @@ #define BUF_DISCARD_SIZE (1 << 20) #define DISCARD_IOV_NUM DIV_ROUND_UP(MAX_WINDOW, BUF_DISCARD_SIZE)
+#define MSS_DEFAULT /* and minimum */ 536 /* as it comes from minimum MTU */ #define MSS4 ROUND_DOWN(IP_MAX_MTU - \ sizeof(struct tcphdr) - \ sizeof(struct iphdr), \ diff --git a/tcp_vu.c b/tcp_vu.c index 1c81ce3..7239401 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -60,6 +60,29 @@ static size_t tcp_vu_hdrlen(bool v6) return hdrlen; }
+/** + * tcp_vu_pad() - Pad 802.3 frame to minimum length (60 bytes) if needed + * @iov: iovec array storing 802.3 frame with TCP segment inside + * @cnt: Number of entries in @iov + */ +static void tcp_vu_pad(struct iovec *iov, size_t cnt) +{ + size_t l2len, pad; + + ASSERT(iov_size(iov, cnt) >= sizeof(struct virtio_net_hdr_mrg_rxbuf)); + l2len = iov_size(iov, cnt) - sizeof(struct virtio_net_hdr_mrg_rxbuf);
Re-obtaining l2len from iov_size() seems kind of awkward, since the callers should already know the length - they've just used it to populate iov_len.
That's only the case for tcp_vu_send_flag() though, because tcp_vu_data_from_sock() can use split buffers and iov_len of the first element is not the same as the whole frame length.
Yes, but..
That is, you could (very much in theory) have iov_len set to 50 for the first iov item, set to 4 for the second iov item, and the frame needs padding, but you can't tell from the first iov item itself.
Before we call tcp_vu_prepare() on that path, we've already calculated 'dlen' from iov_size, here we're calling iov_size a second time.
+ if (l2len >= ETH_ZLEN) + return;
+ + pad = ETH_ZLEN - l2len; + + /* tcp_vu_sock_recv() requests at least MSS-sized vhost-user buffers */ + static_assert(ETH_ZLEN <= MSS_DEFAULT);
So, this is true for the data path, but not AFAICT for the flags path.
There _is_ still enough space in this case, because we request space for (tcp_vu_hdrlen() + sizeof(struct tcp_syn_opts)) which works out to: ETH_HLEN 14 + IP header 20 + TCP header 20 + tcp_syn_opts 8 ---- 62 > ETH_ZLEN
But the comment and assert are misleading.
Dropped, in favour of:
It seems like it would make more sense to clamp ETH_ZLEN as a lower length bound before we vu_collect() the buffers.
this.
👍
Or indeed, like we should be calculating l2len already including the clamping.
That's not trivial to do for the data path, I think (see above). I think it would be doable with a rework of the tcp_vu_data_from_sock() loop but I'd say it's beyond the scope of this series.
Yeah, fair enough.
+ memset(&iov[cnt - 1].iov_base + iov[cnt - 1].iov_len, 0, pad); + iov[cnt - 1].iov_len += pad; +} + /** * tcp_vu_send_flag() - Send segment with flags to vhost-user (no payload) * @c: Execution context @@ -138,6 +161,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) tcp_fill_headers(c, conn, NULL, eh, ip4h, ip6h, th, &payload, NULL, seq, !*c->pcap);
+ tcp_vu_pad(&flags_elem[0].in_sg[0], 1); + if (*c->pcap) { pcap_iov(&flags_elem[0].in_sg[0], 1, sizeof(struct virtio_net_hdr_mrg_rxbuf)); @@ -456,6 +481,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
+ tcp_vu_pad(iov, buf_cnt); + if (*c->pcap) { pcap_iov(iov, buf_cnt, sizeof(struct virtio_net_hdr_mrg_rxbuf)); diff --git a/udp_vu.c b/udp_vu.c index 099677f..1b60860 100644 --- a/udp_vu.c +++ b/udp_vu.c @@ -72,8 +72,8 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, { const struct vu_dev *vdev = c->vdev; int iov_cnt, idx, iov_used; + size_t off, hdrlen, l2len; struct msghdr msg = { 0 }; - size_t off, hdrlen;
ASSERT(!c->no_udp);
@@ -116,6 +116,15 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, iov_vu[idx].iov_len = off; iov_used = idx + !!off;
+ /* pad 802.3 frame to 60 bytes if needed */ + l2len = *dlen + hdrlen - sizeof(struct virtio_net_hdr_mrg_rxbuf); + if (l2len < ETH_ZLEN) { + size_t pad = ETH_ZLEN - l2len; + + iov_vu[idx].iov_len += pad; + memset(&iov_vu[idx].iov_base + off, 0, pad); + } + vu_set_vnethdr(vdev, iov_vu[0].iov_base, iov_used);
/* release unused buffers */ -- 2.43.0
-- Stefano
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (3)
-
David Gibson
-
Laurent Vivier
-
Stefano Brivio