[PATCH v3 0/5] Pad all inbound frames to 802.3 minimum size if needed
Patch 1/5 handles the easy non-batched case with a copy to a padded buffer (only if needed). Patches 2/5 and 3/5 clean coding style up before further changes. Patch 4/5 deals with TCP and UDP batched frames in non-vhost-user modes. Patch 5/5 is for batched frames in vhost-user mode instead. v3: * in 4/5, add calls to tap_hdr_update() from UDP and TCP code to actually update the frame length descriptors after padding * refactor 5/5 entirely based on David's feedback: assume that the first buffer supplied by the guest is at least 60 bytes long, and introduce a vu_pad() common function for both TCP and UDP. Don't recalculate l2len in it, as all the callers already do. v2: * in 1/5, calculate vnet_len *after* padding * in 5/5: - pass the right pointers to memset() - explicitly request at least ETH_ZLEN bytes from vu_collect() Stefano Brivio (5): tap: Pad non-batched frames to 802.3 minimum (60 bytes) if needed tcp: Fix coding style for comment to enum tcp_iov_parts udp: Fix coding style for comment to enum udp_iov_idx tcp, udp: Pad batched frames to 60 bytes (802.3 minimum) in non-vhost-user modes tcp, udp: Pad batched frames for vhost-user modes to 60 bytes (802.3 minimum) tap.c | 11 ++++++++++- tcp.c | 2 +- tcp_buf.c | 23 +++++++++++++++++++++++ tcp_internal.h | 4 +++- tcp_vu.c | 17 ++++++++++++----- udp.c | 43 +++++++++++++++++++++++++++++++++++-------- udp_vu.c | 8 ++++++-- util.c | 3 +++ util.h | 3 +++ vu_common.c | 14 ++++++++++++++ vu_common.h | 1 + 11 files changed, 111 insertions(+), 18 deletions(-) -- 2.43.0
IEEE 802.3 requires a minimum frame payload of 46 bytes, without a
802.1Q tag. Add padding for the simple tap_send_single() case using
a zero-filled 60-byte buffer and copying data to it if needed.
In theory, we could add a further element in the iovec array, say:
uint8_t padding[ETH_ZLEN] = { 0 };
struct iovec iov[3];
...
if (l2len < ETH_ZLEN) {
iov[iovcnt].iov_base = (void *)padding;
iov[iovcnt].iov_len = ETH_ZLEN - l2len;
iovcnt++;
}
and avoid a copy, but that would substantially complicate the
vhost-user path, and it's questionable whether passing a reference
to a further buffer actually causes lower overhead than the simple
copy.
Link: https://bugs.passt.top/show_bug.cgi?id=166
Signed-off-by: Stefano Brivio
...as I'm going to add a new value to it.
Fixes: 95601237ef82 ("tcp: Replace TCP buffer structure by an iovec array")
Signed-off-by: Stefano Brivio
...as I'm going to add a new value to it.
Fixes: 3f9bd867b585 ("udp: Split tap-bound UDP packets into multiple buffers using io vector")
Signed-off-by: Stefano Brivio
Add a further iovec frame part, TCP_IOV_ETH_PAD for TCP and
UDP_IOV_ETH_PAD for UDP, after the payload, make that point to a
zero-filled buffer, and send out a part of it if needed to reach
the minimum frame length given by 802.3, that is, 60 bytes altogether.
The frames we might need to pad are IPv4 only (the IPv6 header is
larger), and are typically TCP ACK segments but can also be small
data segments or datagrams.
Link: https://bugs.passt.top/show_bug.cgi?id=166
Signed-off-by: Stefano Brivio
For both TCP and UDP, instead of just expecting the first provided
buffer to be large enough to contain the headers we need (from 42
bytes for UDP data over IPv4 to 82 bytes for TCP with options over
IPv6), change that assumption to make sure that buffers are anyway
at least ETH_ZLEN-sized buffers (60 bytes).
This looks reasonable because there are no known vhost-user
hypervisor implementations that would give us smaller buffers than
that, and we would anyway hit an assertion failure with IPv6 if we
ever had less than 60 bytes per buffer.
At this point, all we need to do is to pad the first (and only)
buffer, should data and headers use less than that.
Link: https://bugs.passt.top/show_bug.cgi?id=166
Signed-off-by: Stefano Brivio
On Sat, Dec 06, 2025 at 02:26:57AM +0100, Stefano Brivio wrote:
For both TCP and UDP, instead of just expecting the first provided buffer to be large enough to contain the headers we need (from 42 bytes for UDP data over IPv4 to 82 bytes for TCP with options over IPv6), change that assumption to make sure that buffers are anyway at least ETH_ZLEN-sized buffers (60 bytes).
This looks reasonable because there are no known vhost-user hypervisor implementations that would give us smaller buffers than that, and we would anyway hit an assertion failure with IPv6 if we ever had less than 60 bytes per buffer.
At this point, all we need to do is to pad the first (and only) buffer, should data and headers use less than that.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
I find the way iov_len changes through the whole path a bit confusing:
1. First, it represents buffer length we get from vu_collect
2. We check that it has room for a padded frame
3. We decrease it based on the _unpadded_ frame
4. We write the padding to the buffer
5. We increase it again to include the padding
The last step is safe because of the earlier check. I think it's now
adequately clear that it's safe, but it's not _super_ clear because
there are still some in between steps. I'd kind of prefer to not
shrink iov_len below the value for the padded frame size in the first
place.
But, this does the job, and I don't really want to hold this up any
longer so,
Reviewed-by: David Fibson
--- tcp_vu.c | 17 ++++++++++++----- udp_vu.c | 8 ++++++-- vu_common.c | 14 ++++++++++++++ vu_common.h | 1 + 4 files changed, 33 insertions(+), 7 deletions(-)
diff --git a/tcp_vu.c b/tcp_vu.c index 1c81ce3..db9db78 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -91,12 +91,12 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) vu_set_element(&flags_elem[0], NULL, &flags_iov[0]);
elem_cnt = vu_collect(vdev, vq, &flags_elem[0], 1, - hdrlen + sizeof(struct tcp_syn_opts), NULL); + MAX(hdrlen + sizeof(*opts), ETH_ZLEN), NULL); if (elem_cnt != 1) return -1;
ASSERT(flags_elem[0].in_sg[0].iov_len >= - hdrlen + sizeof(struct tcp_syn_opts)); + MAX(hdrlen + sizeof(*opts), ETH_ZLEN));
vu_set_vnethdr(vdev, flags_elem[0].in_sg[0].iov_base, 1);
@@ -138,6 +138,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) tcp_fill_headers(c, conn, NULL, eh, ip4h, ip6h, th, &payload, NULL, seq, !*c->pcap);
+ vu_pad(&flags_elem[0].in_sg[0], hdrlen + optlen); + if (*c->pcap) { pcap_iov(&flags_elem[0].in_sg[0], 1, sizeof(struct virtio_net_hdr_mrg_rxbuf)); @@ -211,7 +213,8 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq,
cnt = vu_collect(vdev, vq, &elem[elem_cnt], VIRTQUEUE_MAX_SIZE - elem_cnt, - MIN(mss, fillsize) + hdrlen, &frame_size); + MAX(MIN(mss, fillsize) + hdrlen, ETH_ZLEN), + &frame_size); if (cnt == 0) break;
@@ -254,6 +257,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq,
len -= iov->iov_len; } + /* adjust head count */ while (*head_cnt > 0 && head[*head_cnt - 1] >= i) (*head_cnt)--; @@ -301,9 +305,9 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn, struct ethhdr *eh;
/* we guess the first iovec provided by the guest can embed - * all the headers needed by L2 frame + * all the headers needed by L2 frame, including any padding */ - ASSERT(iov[0].iov_len >= hdrlen); + ASSERT(iov[0].iov_len >= MAX(hdrlen, ETH_ZLEN));
eh = vu_eth(base);
@@ -456,6 +460,9 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
+ /* Pad first/single buffer only, it's at least ETH_ZLEN long */ + vu_pad(iov, dlen + hdrlen); + if (*c->pcap) { pcap_iov(iov, buf_cnt, sizeof(struct virtio_net_hdr_mrg_rxbuf)); diff --git a/udp_vu.c b/udp_vu.c index 099677f..c30dcf9 100644 --- a/udp_vu.c +++ b/udp_vu.c @@ -72,8 +72,8 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, { const struct vu_dev *vdev = c->vdev; int iov_cnt, idx, iov_used; + size_t off, hdrlen, l2len; struct msghdr msg = { 0 }; - size_t off, hdrlen;
ASSERT(!c->no_udp);
@@ -90,7 +90,7 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, return 0;
/* reserve space for the headers */ - ASSERT(iov_vu[0].iov_len >= hdrlen); + ASSERT(iov_vu[0].iov_len >= MAX(hdrlen, ETH_ZLEN)); iov_vu[0].iov_base = (char *)iov_vu[0].iov_base + hdrlen; iov_vu[0].iov_len -= hdrlen;
@@ -116,6 +116,10 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, iov_vu[idx].iov_len = off; iov_used = idx + !!off;
+ /* pad frame to 60 bytes: first buffer is at least ETH_ZLEN long */ + l2len = *dlen + hdrlen - sizeof(struct virtio_net_hdr_mrg_rxbuf); + vu_pad(&iov_vu[0], l2len); + vu_set_vnethdr(vdev, iov_vu[0].iov_base, iov_used);
/* release unused buffers */ diff --git a/vu_common.c b/vu_common.c index ce61fa6..c682498 100644 --- a/vu_common.c +++ b/vu_common.c @@ -293,3 +293,17 @@ err:
return -1; } + +/** + * vu_pad() - Pad 802.3 frame to minimum length (60 bytes) if needed + * @iov: Buffer in iovec array where end of 802.3 frame is stored + * @l2len: Layer-2 length already filled in frame + */ +void vu_pad(struct iovec *iov, size_t l2len) +{ + if (l2len >= ETH_ZLEN) + return; + + memset((char *)iov->iov_base + iov->iov_len, 0, ETH_ZLEN - l2len); + iov->iov_len += ETH_ZLEN - l2len; +} diff --git a/vu_common.h b/vu_common.h index c0883b2..27fe7e0 100644 --- a/vu_common.h +++ b/vu_common.h @@ -57,5 +57,6 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, const struct timespec *now); int vu_send_single(const struct ctx *c, const void *buf, size_t size); +void vu_pad(struct iovec *iov, size_t l2len);
#endif /* VU_COMMON_H */ -- 2.43.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Sat, Dec 06, 2025 at 02:26:56AM +0100, Stefano Brivio wrote:
Add a further iovec frame part, TCP_IOV_ETH_PAD for TCP and UDP_IOV_ETH_PAD for UDP, after the payload, make that point to a zero-filled buffer, and send out a part of it if needed to reach the minimum frame length given by 802.3, that is, 60 bytes altogether.
The frames we might need to pad are IPv4 only (the IPv6 header is larger), and are typically TCP ACK segments but can also be small data segments or datagrams.
Link: https://bugs.passt.top/show_bug.cgi?id=166 Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- tcp.c | 2 +- tcp_buf.c | 23 +++++++++++++++++++++++ tcp_internal.h | 2 ++ udp.c | 33 ++++++++++++++++++++++++++++++--- util.c | 3 +++ util.h | 3 +++ 6 files changed, 62 insertions(+), 4 deletions(-)
diff --git a/tcp.c b/tcp.c index d6a5337..a582e4b 100644 --- a/tcp.c +++ b/tcp.c @@ -1037,7 +1037,7 @@ void tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, else tcp_update_csum(psum, th, payload);
- tap_hdr_update(taph, l3len + sizeof(struct ethhdr)); + tap_hdr_update(taph, MAX(l3len + sizeof(struct ethhdr), ETH_ZLEN)); }
/** diff --git a/tcp_buf.c b/tcp_buf.c index 2058225..5d419d3 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -96,6 +96,7 @@ void tcp_sock_iov_init(const struct ctx *c) iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp_payload_tap_hdr[i]); iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr); iov[TCP_IOV_PAYLOAD].iov_base = &tcp_payload[i]; + iov[TCP_IOV_ETH_PAD].iov_base = eth_pad; } }
@@ -144,6 +145,22 @@ void tcp_payload_flush(const struct ctx *c) tcp_payload_used = 0; }
+/** + * tcp_l2_buf_pad() - Calculate padding to send out of padding (zero) buffer + * @iov: Pointer to iovec of frame parts we're about to send + */ +static void tcp_l2_buf_pad(struct iovec *iov) +{ + size_t l2len = iov[TCP_IOV_ETH].iov_len + + iov[TCP_IOV_IP].iov_len + + iov[TCP_IOV_PAYLOAD].iov_len; + + if (l2len < ETH_ZLEN) + iov[TCP_IOV_ETH_PAD].iov_len = ETH_ZLEN - l2len; + else + iov[TCP_IOV_ETH_PAD].iov_len = 0; +} + /** * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers * @c: Execution context @@ -212,6 +229,8 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) iov[TCP_IOV_PAYLOAD].iov_len = l4len; tcp_l2_buf_fill_headers(c, conn, iov, NULL, seq, false);
+ tcp_l2_buf_pad(iov); + if (flags & DUP_ACK) { struct iovec *dup_iov = tcp_l2_iov[tcp_payload_used]; tcp_frame_conns[tcp_payload_used++] = conn; @@ -223,6 +242,7 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) memcpy(dup_iov[TCP_IOV_PAYLOAD].iov_base, iov[TCP_IOV_PAYLOAD].iov_base, l4len); dup_iov[TCP_IOV_PAYLOAD].iov_len = l4len; + dup_iov[TCP_IOV_ETH_PAD].iov_len = iov[TCP_IOV_ETH_PAD].iov_len; }
if (tcp_payload_used > TCP_FRAMES_MEM - 2) @@ -270,6 +290,9 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, payload->th.psh = push; iov[TCP_IOV_PAYLOAD].iov_len = dlen + sizeof(struct tcphdr); tcp_l2_buf_fill_headers(c, conn, iov, check, seq, false); + + tcp_l2_buf_pad(iov); + if (++tcp_payload_used > TCP_FRAMES_MEM - 1) tcp_payload_flush(c); } diff --git a/tcp_internal.h b/tcp_internal.h index 19e8922..5f8fb35 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -63,6 +63,7 @@ * @TCP_IOV_ETH Ethernet header * @TCP_IOV_IP IP (v4/v6) header * @TCP_IOV_PAYLOAD IP payload (TCP header + data) + * @TCP_IOV_ETH_PAD Ethernet (802.3) padding to 60 bytes * @TCP_NUM_IOVS the number of entries in the iovec array */ enum tcp_iov_parts { @@ -70,6 +71,7 @@ enum tcp_iov_parts { TCP_IOV_ETH = 1, TCP_IOV_IP = 2, TCP_IOV_PAYLOAD = 3, + TCP_IOV_ETH_PAD = 4, TCP_NUM_IOVS };
diff --git a/udp.c b/udp.c index b93c18b..08bec50 100644 --- a/udp.c +++ b/udp.c @@ -168,6 +168,7 @@ udp_meta[UDP_MAX_FRAMES]; * @UDP_IOV_ETH Ethernet header * @UDP_IOV_IP IP (v4/v6) header * @UDP_IOV_PAYLOAD IP payload (UDP header + data) + * @UDP_IOV_ETH_PAD Ethernet (802.3) padding to 60 bytes * @UDP_NUM_IOVS the number of entries in the iovec array */ enum udp_iov_idx { @@ -175,6 +176,7 @@ enum udp_iov_idx { UDP_IOV_ETH, UDP_IOV_IP, UDP_IOV_PAYLOAD, + UDP_IOV_ETH_PAD, UDP_NUM_IOVS, };
@@ -239,6 +241,7 @@ static void udp_iov_init_one(const struct ctx *c, size_t i) tiov[UDP_IOV_ETH] = IOV_OF_LVALUE(udp_eth_hdr[i]); tiov[UDP_IOV_TAP] = tap_hdr_iov(c, &meta->taph); tiov[UDP_IOV_PAYLOAD].iov_base = payload; + tiov[UDP_IOV_ETH_PAD].iov_base = eth_pad;
mh->msg_iov = siov; mh->msg_iovlen = 1; @@ -344,6 +347,22 @@ size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp, return l4len; }
+/** + * udp_tap_pad() - Calculate padding to send out of padding (zero) buffer + * @iov: Pointer to iovec of frame parts we're about to send + */ +static void udp_tap_pad(struct iovec *iov) +{ + size_t l2len = iov[UDP_IOV_ETH].iov_len + + iov[UDP_IOV_IP].iov_len + + iov[UDP_IOV_PAYLOAD].iov_len; + + if (l2len < ETH_ZLEN) + iov[UDP_IOV_ETH_PAD].iov_len = ETH_ZLEN - l2len; + else + iov[UDP_IOV_ETH_PAD].iov_len = 0; +} + /** * udp_tap_prepare() - Convert one datagram into a tap frame * @mmh: Receiving mmsghdr array @@ -362,23 +381,31 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, struct ethhdr *eh = (*tap_iov)[UDP_IOV_ETH].iov_base; struct udp_payload_t *bp = &udp_payload[idx]; struct udp_meta_t *bm = &udp_meta[idx]; - size_t l4len; + size_t l4len, l2len;
eth_update_mac(eh, NULL, tap_omac); if (!inany_v4(&toside->eaddr) || !inany_v4(&toside->oaddr)) { l4len = udp_update_hdr6(&bm->ip6h, bp, toside, mmh[idx].msg_len, no_udp_csum); - tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) + ETH_HLEN); + + l2len = MAX(l4len + sizeof(bm->ip6h) + ETH_HLEN, ETH_ZLEN); + tap_hdr_update(&bm->taph, l2len); + eh->h_proto = htons_constant(ETH_P_IPV6); (*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip6h); } else { l4len = udp_update_hdr4(&bm->ip4h, bp, toside, mmh[idx].msg_len, no_udp_csum); - tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) + ETH_HLEN); + + l2len = MAX(l4len + sizeof(bm->ip4h) + ETH_HLEN, ETH_ZLEN); + tap_hdr_update(&bm->taph, l2len); + eh->h_proto = htons_constant(ETH_P_IP); (*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip4h); } (*tap_iov)[UDP_IOV_PAYLOAD].iov_len = l4len; + + udp_tap_pad(*tap_iov); }
/** diff --git a/util.c b/util.c index 590373d..d2039df 100644 --- a/util.c +++ b/util.c @@ -39,6 +39,9 @@ #include
#endif +/* Zero-filled buffer to pad 802.3 frames, up to 60 (ETH_ZLEN) bytes */ +uint8_t eth_pad[ETH_ZLEN] = { 0 }; + /** * sock_l4_() - Create and bind socket to socket address * @c: Execution context diff --git a/util.h b/util.h index 40de694..326012c 100644 --- a/util.h +++ b/util.h @@ -17,6 +17,7 @@ #include
#include #include +#include #include "log.h"
@@ -152,6 +153,8 @@ void abort_with_msg(const char *fmt, ...) #define ntohll(x) (be64toh((x))) #define htonll(x) (htobe64((x)))
+extern uint8_t eth_pad[ETH_ZLEN]; + /** * ntohl_unaligned() - Read 32-bit BE value from a possibly unaligned address * @p: Pointer to the BE value in memory -- 2.43.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (2)
-
David Gibson
-
Stefano Brivio