[PATCH v2 0/5] Pad all inbound frames to 802.3 minimum size if needed
Patch 1/5 handles the easy non-batched case with a copy to a padded buffer (only if needed). Patches 2/5 and 3/5 clean coding style up before further changes. Patch 4/5 deals with TCP and UDP batched frames in non-vhost-user modes. Patch 5/5 is for batched frames in vhost-user mode instead. v2: * in 1/5, calculate vnet_len *after* padding * in 5/5: - pass the right pointers to memset() - explicitly request at least ETH_ZLEN bytes from vu_collect() Stefano Brivio (5): tap: Pad non-batched frames to 802.3 minimum (60 bytes) if needed tcp: Fix coding style for comment to enum tcp_iov_parts udp: Fix coding style for comment to enum udp_iov_idx tcp, udp: Pad batched frames to 60 bytes (802.3 minimum) in non-vhost-user modes tcp, udp: Pad batched frames for vhost-user modes to 60 bytes (802.3 minimum) tap.c | 11 ++++++++++- tcp.c | 2 -- tcp_buf.c | 23 +++++++++++++++++++++++ tcp_internal.h | 5 ++++- tcp_vu.c | 32 ++++++++++++++++++++++++++++---- udp.c | 31 ++++++++++++++++++++++++++----- udp_vu.c | 13 ++++++++++++- util.c | 3 +++ util.h | 3 +++ 9 files changed, 109 insertions(+), 14 deletions(-) -- 2.43.0
IEEE 802.3 requires a minimum frame payload of 46 bytes, without a
802.1Q tag. Add padding for the simple tap_send_single() case using
a zero-filled 60-byte buffer and copying data to it if needed.
In theory, we could add a further element in the iovec array, say:
uint8_t padding[ETH_ZLEN] = { 0 };
struct iovec iov[3];
...
if (l2len < ETH_ZLEN) {
iov[iovcnt].iov_base = (void *)padding;
iov[iovcnt].iov_len = ETH_ZLEN - l2len;
iovcnt++;
}
and avoid a copy, but that would substantially complicate the
vhost-user path, and it's questionable whether passing a reference
to a further buffer actually causes lower overhead than the simple
copy.
Link: https://bugs.passt.top/show_bug.cgi?id=166
Signed-off-by: Stefano Brivio
...as I'm going to add a new value to it.
Fixes: 95601237ef82 ("tcp: Replace TCP buffer structure by an iovec array")
Signed-off-by: Stefano Brivio
...as I'm going to add a new value to it.
Fixes: 3f9bd867b585 ("udp: Split tap-bound UDP packets into multiple buffers using io vector")
Signed-off-by: Stefano Brivio
Add a further iovec frame part, TCP_IOV_ETH_PAD for TCP and
UDP_IOV_ETH_PAD for UDP, after the payload, make that point to a
zero-filled buffer, and send out a part of it if needed to reach
the minimum frame length given by 802.3, that is, 60 bytes altogether.
The frames we might need to pad are IPv4 only (the IPv6 header is
larger), and are typically TCP ACK segments but can also be small
data segments or datagrams.
Link: https://bugs.passt.top/show_bug.cgi?id=166
Signed-off-by: Stefano Brivio
For both TCP and UDP, we request vhost-user buffers that are large
enough to reach ETH_ZLEN (60 bytes), so padding is just a matter of
increasing the appropriate iov_len and clearing bytes in the buffer
as needed.
Link: https://bugs.passt.top/show_bug.cgi?id=166
Signed-off-by: Stefano Brivio
On Fri, Dec 05, 2025 at 01:51:53AM +0100, Stefano Brivio wrote:
IEEE 802.3 requires a minimum frame payload of 46 bytes, without a 802.1Q tag. Add padding for the simple tap_send_single() case using a zero-filled 60-byte buffer and copying data to it if needed.
In theory, we could add a further element in the iovec array, say:
uint8_t padding[ETH_ZLEN] = { 0 }; struct iovec iov[3];
...
if (l2len < ETH_ZLEN) { iov[iovcnt].iov_base = (void *)padding; iov[iovcnt].iov_len = ETH_ZLEN - l2len; iovcnt++; }
and avoid a copy, but that would substantially complicate the vhost-user path, and it's questionable whether passing a reference to a further buffer actually causes lower overhead than the simple copy.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
Reviewed-by: Laurent Vivier
Reviewed-by: David Gibson
--- tap.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/tap.c b/tap.c index 44b0644..e3ea61c 100644 --- a/tap.c +++ b/tap.c @@ -130,9 +130,18 @@ unsigned long tap_l2_max_len(const struct ctx *c) */ void tap_send_single(const struct ctx *c, const void *data, size_t l2len) { - uint32_t vnet_len = htonl(l2len); + uint8_t padded[ETH_ZLEN] = { 0 }; struct iovec iov[2]; size_t iovcnt = 0; + uint32_t vnet_len; + + if (l2len < ETH_ZLEN) { + memcpy(padded, data, l2len); + data = padded; + l2len = ETH_ZLEN; + } + + vnet_len = htonl(l2len);
switch (c->mode) { case MODE_PASST: -- 2.43.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, Dec 05, 2025 at 01:51:57AM +0100, Stefano Brivio wrote:
For both TCP and UDP, we request vhost-user buffers that are large enough to reach ETH_ZLEN (60 bytes), so padding is just a matter of increasing the appropriate iov_len and clearing bytes in the buffer as needed.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
--- tcp.c | 2 -- tcp_internal.h | 1 + tcp_vu.c | 32 ++++++++++++++++++++++++++++---- udp_vu.c | 13 ++++++++++++- 4 files changed, 41 insertions(+), 7 deletions(-) diff --git a/tcp.c b/tcp.c index c5486bc..8cd062f 100644 --- a/tcp.c +++ b/tcp.c @@ -341,8 +341,6 @@ enum { }; #endif
-/* MSS rounding: see SET_MSS() */ -#define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define RTO_INIT 1 /* s, RFC 6298 */ diff --git a/tcp_internal.h b/tcp_internal.h index 5f8fb35..d2295c9 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -12,6 +12,7 @@ #define BUF_DISCARD_SIZE (1 << 20) #define DISCARD_IOV_NUM DIV_ROUND_UP(MAX_WINDOW, BUF_DISCARD_SIZE)
+#define MSS_DEFAULT /* and minimum */ 536 /* as it comes from minimum MTU */ #define MSS4 ROUND_DOWN(IP_MAX_MTU - \ sizeof(struct tcphdr) - \ sizeof(struct iphdr), \ diff --git a/tcp_vu.c b/tcp_vu.c index 1c81ce3..638813c 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -60,6 +60,26 @@ static size_t tcp_vu_hdrlen(bool v6) return hdrlen; }
+/** + * tcp_vu_pad() - Pad 802.3 frame to minimum length (60 bytes) if needed + * @iov: iovec array storing 802.3 frame with TCP segment inside + * @cnt: Number of entries in @iov + */ +static void tcp_vu_pad(struct iovec *iov, size_t cnt) +{ + size_t l2len, pad; + + ASSERT(iov_size(iov, cnt) >= sizeof(struct virtio_net_hdr_mrg_rxbuf)); + l2len = iov_size(iov, cnt) - sizeof(struct virtio_net_hdr_mrg_rxbuf); + if (l2len >= ETH_ZLEN) + return; + + pad = ETH_ZLEN - l2len; + + memset((char *)iov[cnt - 1].iov_base + iov[cnt - 1].iov_len, 0, pad); + iov[cnt - 1].iov_len += pad;
This assumes there's enough allocated space in last buffer to extend it this way. I'm pretty sure that's true in practice, but it's not super obvious from right here. I can't see a way to avoid that without doing a substantial rework of both paths. But we should at least document the assumption in a comment. Also, there's not anything really TCP specific about this, so we should be able to re-use it for UDP, no?
+} + /** * tcp_vu_send_flag() - Send segment with flags to vhost-user (no payload) * @c: Execution context @@ -91,12 +111,11 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) vu_set_element(&flags_elem[0], NULL, &flags_iov[0]);
elem_cnt = vu_collect(vdev, vq, &flags_elem[0], 1, - hdrlen + sizeof(struct tcp_syn_opts), NULL); + MAX(hdrlen + sizeof(*opts), ETH_ZLEN), NULL); if (elem_cnt != 1) return -1;
- ASSERT(flags_elem[0].in_sg[0].iov_len >= - hdrlen + sizeof(struct tcp_syn_opts)); + ASSERT(flags_elem[0].in_sg[0].iov_len >= hdrlen + sizeof(*opts));
vu_set_vnethdr(vdev, flags_elem[0].in_sg[0].iov_base, 1);
@@ -138,6 +157,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
A few lines above here is where we truncate iov_len[] to match the actual length of the frame. Replace that with a call to the pad/truncate function.
tcp_fill_headers(c, conn, NULL, eh, ip4h, ip6h, th, &payload, NULL, seq, !*c->pcap);
+ tcp_vu_pad(&flags_elem[0].in_sg[0], 1); + if (*c->pcap) { pcap_iov(&flags_elem[0].in_sg[0], 1, sizeof(struct virtio_net_hdr_mrg_rxbuf)); @@ -211,7 +232,8 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq,
cnt = vu_collect(vdev, vq, &elem[elem_cnt], VIRTQUEUE_MAX_SIZE - elem_cnt, - MIN(mss, fillsize) + hdrlen, &frame_size); + MAX(MIN(mss, fillsize) + hdrlen, ETH_ZLEN), + &frame_size); if (cnt == 0) break;
@@ -456,6 +478,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
+ tcp_vu_pad(iov, buf_cnt); + if (*c->pcap) { pcap_iov(iov, buf_cnt, sizeof(struct virtio_net_hdr_mrg_rxbuf)); diff --git a/udp_vu.c b/udp_vu.c index 099677f..33dbb9a 100644 --- a/udp_vu.c +++ b/udp_vu.c @@ -72,8 +72,8 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, { const struct vu_dev *vdev = c->vdev; int iov_cnt, idx, iov_used; + size_t off, hdrlen, l2len; struct msghdr msg = { 0 }; - size_t off, hdrlen;
ASSERT(!c->no_udp);
@@ -116,6 +116,17 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, iov_vu[idx].iov_len = off; iov_used = idx + !!off;
+ /* pad 802.3 frame to 60 bytes if needed */ + l2len = *dlen + hdrlen - sizeof(struct virtio_net_hdr_mrg_rxbuf); + if (l2len < ETH_ZLEN) { + size_t pad = ETH_ZLEN - l2len; + + memset((char *)iov_vu[idx].iov_base + iov_vu[idx].iov_len, + 0, pad); + + iov_vu[idx].iov_len += pad; + } + vu_set_vnethdr(vdev, iov_vu[0].iov_base, iov_used);
/* release unused buffers */ -- 2.43.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 5 Dec 2025 01:51:56 +0100
Stefano Brivio
Add a further iovec frame part, TCP_IOV_ETH_PAD for TCP and UDP_IOV_ETH_PAD for UDP, after the payload, make that point to a zero-filled buffer, and send out a part of it if needed to reach the minimum frame length given by 802.3, that is, 60 bytes altogether.
The frames we might need to pad are IPv4 only (the IPv6 header is larger), and are typically TCP ACK segments but can also be small data segments or datagrams.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson Reviewed-by: Laurent Vivier
...for some reason, in combination with the previous series with TCP throughput fixes, this patch now seems to break basic transfers ("large transfer", IPv4, guest to host), with passt only. I'm fairly sure it didn't cause failures when I ran tests on this series alone, at least twice, once now and once for the RFC version, but it might also be that I missed running tests (in isolation) for some reason. No idea where the issue is, yet... -- Stefano
On Fri, 5 Dec 2025 06:48:42 +0100
Stefano Brivio
On Fri, 5 Dec 2025 01:51:56 +0100 Stefano Brivio
wrote: Add a further iovec frame part, TCP_IOV_ETH_PAD for TCP and UDP_IOV_ETH_PAD for UDP, after the payload, make that point to a zero-filled buffer, and send out a part of it if needed to reach the minimum frame length given by 802.3, that is, 60 bytes altogether.
The frames we might need to pad are IPv4 only (the IPv6 header is larger), and are typically TCP ACK segments but can also be small data segments or datagrams.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson Reviewed-by: Laurent Vivier ...for some reason, in combination with the previous series with TCP throughput fixes, this patch now seems to break basic transfers ("large transfer", IPv4, guest to host), with passt only.
Whoa, it turns out I didn't test this one at all, sorry for that, I don't know how it happened. This "clearly" needs: --- diff --git a/tcp.c b/tcp.c index c5486bc..7140b22 100644 --- a/tcp.c +++ b/tcp.c @@ -1037,7 +1037,7 @@ void tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, else tcp_update_csum(psum, th, payload); - tap_hdr_update(taph, l3len + sizeof(struct ethhdr)); + tap_hdr_update(taph, MAX(l3len + sizeof(struct ethhdr), ETH_ZLEN)); } /** diff --git a/udp.c b/udp.c index f32f553..08bec50 100644 --- a/udp.c +++ b/udp.c @@ -381,19 +381,25 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, struct ethhdr *eh = (*tap_iov)[UDP_IOV_ETH].iov_base; struct udp_payload_t *bp = &udp_payload[idx]; struct udp_meta_t *bm = &udp_meta[idx]; - size_t l4len; + size_t l4len, l2len; eth_update_mac(eh, NULL, tap_omac); if (!inany_v4(&toside->eaddr) || !inany_v4(&toside->oaddr)) { l4len = udp_update_hdr6(&bm->ip6h, bp, toside, mmh[idx].msg_len, no_udp_csum); - tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) + ETH_HLEN); + + l2len = MAX(l4len + sizeof(bm->ip6h) + ETH_HLEN, ETH_ZLEN); + tap_hdr_update(&bm->taph, l2len); + eh->h_proto = htons_constant(ETH_P_IPV6); (*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip6h); } else { l4len = udp_update_hdr4(&bm->ip4h, bp, toside, mmh[idx].msg_len, no_udp_csum); - tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) + ETH_HLEN); + + l2len = MAX(l4len + sizeof(bm->ip4h) + ETH_HLEN, ETH_ZLEN); + tap_hdr_update(&bm->taph, l2len); + eh->h_proto = htons_constant(ETH_P_IP); (*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip4h); } --- on top. I'll respin the whole series in a bit (once I addressed David's comments to 5/5 as well). -- Stefano
On Fri, 5 Dec 2025 15:07:27 +1100
David Gibson
On Fri, Dec 05, 2025 at 01:51:57AM +0100, Stefano Brivio wrote:
For both TCP and UDP, we request vhost-user buffers that are large enough to reach ETH_ZLEN (60 bytes), so padding is just a matter of increasing the appropriate iov_len and clearing bytes in the buffer as needed.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
--- tcp.c | 2 -- tcp_internal.h | 1 + tcp_vu.c | 32 ++++++++++++++++++++++++++++---- udp_vu.c | 13 ++++++++++++- 4 files changed, 41 insertions(+), 7 deletions(-) diff --git a/tcp.c b/tcp.c index c5486bc..8cd062f 100644 --- a/tcp.c +++ b/tcp.c @@ -341,8 +341,6 @@ enum { }; #endif
-/* MSS rounding: see SET_MSS() */ -#define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define RTO_INIT 1 /* s, RFC 6298 */ diff --git a/tcp_internal.h b/tcp_internal.h index 5f8fb35..d2295c9 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -12,6 +12,7 @@ #define BUF_DISCARD_SIZE (1 << 20) #define DISCARD_IOV_NUM DIV_ROUND_UP(MAX_WINDOW, BUF_DISCARD_SIZE)
+#define MSS_DEFAULT /* and minimum */ 536 /* as it comes from minimum MTU */ #define MSS4 ROUND_DOWN(IP_MAX_MTU - \ sizeof(struct tcphdr) - \ sizeof(struct iphdr), \ diff --git a/tcp_vu.c b/tcp_vu.c index 1c81ce3..638813c 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -60,6 +60,26 @@ static size_t tcp_vu_hdrlen(bool v6) return hdrlen; }
+/** + * tcp_vu_pad() - Pad 802.3 frame to minimum length (60 bytes) if needed + * @iov: iovec array storing 802.3 frame with TCP segment inside + * @cnt: Number of entries in @iov + */ +static void tcp_vu_pad(struct iovec *iov, size_t cnt) +{ + size_t l2len, pad; + + ASSERT(iov_size(iov, cnt) >= sizeof(struct virtio_net_hdr_mrg_rxbuf)); + l2len = iov_size(iov, cnt) - sizeof(struct virtio_net_hdr_mrg_rxbuf); + if (l2len >= ETH_ZLEN) + return; + + pad = ETH_ZLEN - l2len; + + memset((char *)iov[cnt - 1].iov_base + iov[cnt - 1].iov_len, 0, pad); + iov[cnt - 1].iov_len += pad;
This assumes there's enough allocated space in last buffer to extend it this way. I'm pretty sure that's true in practice, but it's not super obvious from right here.
I can't see a way to avoid that without doing a substantial rework of both paths. But we should at least document the assumption in a comment.
Oops, that's actually not guaranteed. The only assumption we have is that the first buffer is at least 54 bytes long for IPv4: /* we guess the first iovec provided by the guest can embed * all the headers needed by L2 frame */ ASSERT(iov[0].iov_len >= hdrlen); because as far as I understand there's no known vhost-user implementation that would give us less than 1k or so, and we request at least 60 bytes from vu_collect(), but that might lead for example to a situation with iov[n].iov_len = < 54, 1, 5 > and we would write 6 bytes to the last element. I could have switched to something like a iov_from_buf() call from the 60-byte eth_pad[] I added in 4/5, but there's no version adjusting iov_len, and we probably wouldn't need it for anything else. So I changed approach altogether: I'll just ASSERT() that we have at least 60 bytes in the first buffer. It's just 6 bytes on top in the worst case, and still much less than any buffer we might actually see in practice.
Also, there's not anything really TCP specific about this, so we should be able to re-use it for UDP, no?
Renamed to vu_pad() and moved to vu_common.c.
+} + /** * tcp_vu_send_flag() - Send segment with flags to vhost-user (no payload) * @c: Execution context @@ -91,12 +111,11 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) vu_set_element(&flags_elem[0], NULL, &flags_iov[0]);
elem_cnt = vu_collect(vdev, vq, &flags_elem[0], 1, - hdrlen + sizeof(struct tcp_syn_opts), NULL); + MAX(hdrlen + sizeof(*opts), ETH_ZLEN), NULL); if (elem_cnt != 1) return -1;
- ASSERT(flags_elem[0].in_sg[0].iov_len >= - hdrlen + sizeof(struct tcp_syn_opts)); + ASSERT(flags_elem[0].in_sg[0].iov_len >= hdrlen + sizeof(*opts));
vu_set_vnethdr(vdev, flags_elem[0].in_sg[0].iov_base, 1);
@@ -138,6 +157,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
A few lines above here is where we truncate iov_len[] to match the actual length of the frame. Replace that with a call to the pad/truncate function.
Ah, no need anymore, as the first buffer is now the only buffer that might ever need padding.
tcp_fill_headers(c, conn, NULL, eh, ip4h, ip6h, th, &payload, NULL, seq, !*c->pcap);
+ tcp_vu_pad(&flags_elem[0].in_sg[0], 1); + if (*c->pcap) { pcap_iov(&flags_elem[0].in_sg[0], 1, sizeof(struct virtio_net_hdr_mrg_rxbuf)); @@ -211,7 +232,8 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq,
cnt = vu_collect(vdev, vq, &elem[elem_cnt], VIRTQUEUE_MAX_SIZE - elem_cnt, - MIN(mss, fillsize) + hdrlen, &frame_size); + MAX(MIN(mss, fillsize) + hdrlen, ETH_ZLEN), + &frame_size); if (cnt == 0) break;
@@ -456,6 +478,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
+ tcp_vu_pad(iov, buf_cnt); + if (*c->pcap) { pcap_iov(iov, buf_cnt, sizeof(struct virtio_net_hdr_mrg_rxbuf)); diff --git a/udp_vu.c b/udp_vu.c index 099677f..33dbb9a 100644 --- a/udp_vu.c +++ b/udp_vu.c @@ -72,8 +72,8 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, { const struct vu_dev *vdev = c->vdev; int iov_cnt, idx, iov_used; + size_t off, hdrlen, l2len; struct msghdr msg = { 0 }; - size_t off, hdrlen;
ASSERT(!c->no_udp);
@@ -116,6 +116,17 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, iov_vu[idx].iov_len = off; iov_used = idx + !!off;
+ /* pad 802.3 frame to 60 bytes if needed */ + l2len = *dlen + hdrlen - sizeof(struct virtio_net_hdr_mrg_rxbuf); + if (l2len < ETH_ZLEN) { + size_t pad = ETH_ZLEN - l2len; + + memset((char *)iov_vu[idx].iov_base + iov_vu[idx].iov_len, + 0, pad); + + iov_vu[idx].iov_len += pad; + } + vu_set_vnethdr(vdev, iov_vu[0].iov_base, iov_used);
/* release unused buffers */ -- 2.43.0
-- Stefano
participants (2)
-
David Gibson
-
Stefano Brivio