[PATCH 0/7] vhost-user,tcp: Handle multiple iovec entries per virtqueue element
This is the TCP counterpart to the UDP multi-iov series. It converts the TCP vhost-user receive path from direct pointer arithmetic (via vu_eth(), vu_ip(), etc.) to the iov_tail abstraction, removing the assumption that all headers reside in a single contiguous buffer. With this series applied, the TCP path correctly handles virtio-net drivers that provide multiple buffers per virtqueue element (e.g. iPXE provides the vnet header in the first buffer and the frame payload in a second one), matching the support already present in the UDP path. Based-on: 20260323143151.538673-1-lvivier@redhat.com Laurent Vivier (7): tcp: pass ipv4h checksum, not a pointer to the checksum tcp: use iov_tail to access headers in tcp_fill_headers() tcp_vu: Use iov_tail helpers to build headers in tcp_vu_prepare() tcp_vu: Support multibuffer frames in tcp_vu_sock_recv() tcp: Use iov_tail to access headers in tcp_prepare_flags() iov: introduce iov_memcopy() tcp_vu: Use iov_tail helpers to build headers in tcp_vu_send_flag() iov.c | 45 +++++++ iov.h | 2 + tcp.c | 180 ++++++++++++++----------- tcp_buf.c | 34 ++--- tcp_internal.h | 9 +- tcp_vu.c | 354 ++++++++++++++++++++++++++++--------------------- vu_common.h | 20 --- 7 files changed, 368 insertions(+), 276 deletions(-) -- 2.53.0
tcp_fill_headers() takes a pointer to a previously computed IPv4 header
checksum to avoid recalculating it when the payload length doesn't
change. A subsequent patch makes tcp_fill_headers() access ip4h via
with_header() which scopes it to a temporary variable, so a pointer to
ip4h->check would become dangling after the with_header() block.
Pass the checksum by value as an int instead, using -1 as the sentinel
to indicate that the checksum should be computed from scratch (replacing
the NULL pointer sentinel). As the checksum is a uint16_t, -1 cannot be
a valid checksum value in an int.
Signed-off-by: Laurent Vivier
Instead of receiving individual pointers to each protocol header (eh,
ip4h, ip6h, th), have tcp_fill_headers() take an iov_tail starting at
the Ethernet header and walk through it using with_header() and
IOV_DROP_HEADER() to access each header in turn.
Replace the ip4h/ip6h NULL-pointer convention with a bool ipv4
parameter, and move Ethernet header filling (MAC address and ethertype)
into tcp_fill_headers() as well, since the function now owns the full
header chain.
This simplifies callers, which no longer need to extract and pass
individual header pointers.
Signed-off-by: Laurent Vivier
Replace direct pointer arithmetic using vu_eth(), vu_ip(),
vu_payloadv4() and vu_payloadv6() with the iov_tail abstraction
to build Ethernet, IP and TCP headers in tcp_vu_prepare().
This removes the assumption that all headers reside in a single
contiguous iov entry.
Signed-off-by: Laurent Vivier
Previously, tcp_vu_sock_recv() assumed a 1:1 mapping between virtqueue
elements and iovecs (one iovec per element), enforced by an ASSERT.
This prevented the use of virtqueue elements with multiple buffers
(e.g. when mergeable rx buffers are not negotiated and headers are
provided in a separate buffer).
Introduce a struct vu_frame to track per-frame metadata: the range of
elements and iovecs that make up each frame, and the frame's total size.
This replaces the head[] array which only tracked element indices.
A separate iov_msg[] array is built for recvmsg() by cloning the data
portions (after stripping headers) using iov_tail helpers.
Then a frame truncation after recvmsg() properly walks the frame and
element arrays to adjust iovec counts and element counts.
Signed-off-by: Laurent Vivier
Instead of passing separate pointers to the TCP header and SYN options,
pass an iov_tail pointing to the start of the TCP payload area.
tcp_prepare_flags() then uses with_header() and IOV_DROP_HEADER() to
access the TCP header and SYN options directly within the iov, matching
the iov_tail-based approach already used by tcp_fill_headers().
Signed-off-by: Laurent Vivier
Copy buffers data from one iovec array to another.
Signed-off-by: Laurent Vivier
tcp_vu_send_flag() previously used vu_eth(), vu_ip(), vu_payloadv4() and
vu_payloadv6() to access headers via direct pointer arithmetic on a
single contiguous buffer. Replace these with iov_tail-based
with_header() and IOV_DROP_HEADER() accessors, which work correctly with
multi-buffer virtio descriptors.
Enlarge the flags_iov and flags_elem arrays to accommodate split
descriptors, and pass iov_cnt through vu_collect() so we track
how many iovecs the frame actually spans.
Extract a tcp_vu_send_dup() helper for the DUP_ACK path that uses
iov_memcopy() to duplicate a frame across potentially multi-buffer
descriptors, replacing the previous single-buffer memcpy.
With all callers converted, remove the vu_eth(), vu_ip(),
vu_payloadv4() and vu_payloadv6() inline helpers from vu_common.h,
and drop the cppcheck-suppress annotation on iov_memcopy() which is
now used.
Signed-off-by: Laurent Vivier
On Mon, Mar 23, 2026 at 05:52:53PM +0100, Laurent Vivier wrote:
tcp_fill_headers() takes a pointer to a previously computed IPv4 header checksum to avoid recalculating it when the payload length doesn't change. A subsequent patch makes tcp_fill_headers() access ip4h via with_header() which scopes it to a temporary variable, so a pointer to ip4h->check would become dangling after the with_header() block.
Oof, that kind of indicates the dangers with the with_header() structure. Is that change merged already? If not we should probably fix it before merge rather than after the fact.
Pass the checksum by value as an int instead, using -1 as the sentinel to indicate that the checksum should be computed from scratch (replacing the NULL pointer sentinel). As the checksum is a uint16_t, -1 cannot be a valid checksum value in an int.
That said, passing this by value I think is cleaner than the pointer,
regardless of other reasons. Would it also make sense to flag
no_tcp_csum with an additional special value (a #define to -2, or
whatever) instead of using an extra parameter? Logically the checksum
parameter would be:
CALCULATE | UNNEEDED | specific value
Reviewed-by: David Gibson
Signed-off-by: Laurent Vivier
--- tcp.c | 7 +++---- tcp_buf.c | 10 +++++----- tcp_internal.h | 3 +-- tcp_vu.c | 12 ++++++------ 4 files changed, 15 insertions(+), 17 deletions(-) diff --git a/tcp.c b/tcp.c index b14586249c4e..158a5be0327e 100644 --- a/tcp.c +++ b/tcp.c @@ -953,8 +953,7 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, struct ethhdr *eh, struct iphdr *ip4h, struct ipv6hdr *ip6h, struct tcphdr *th, struct iov_tail *payload, - const uint16_t *ip4_check, uint32_t seq, - bool no_tcp_csum) + int ip4_check, uint32_t seq, bool no_tcp_csum) { const struct flowside *tapside = TAPFLOW(conn); size_t l4len = iov_tail_size(payload) + sizeof(*th); @@ -974,8 +973,8 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, ip4h->saddr = src4->s_addr; ip4h->daddr = dst4->s_addr;
- if (ip4_check) - ip4h->check = *ip4_check; + if (ip4_check != -1) + ip4h->check = ip4_check; else ip4h->check = csum_ip4_header(l3len, IPPROTO_TCP, *src4, *dst4); diff --git a/tcp_buf.c b/tcp_buf.c index 41965b107567..bc0f58dd7a5e 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -172,7 +172,7 @@ static void tcp_l2_buf_pad(struct iovec *iov) */ static void tcp_l2_buf_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, - struct iovec *iov, const uint16_t *check, + struct iovec *iov, int check, uint32_t seq, bool no_tcp_csum) { struct iov_tail tail = IOV_TAIL(&iov[TCP_IOV_PAYLOAD], 1, 0); @@ -233,7 +233,7 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) if (flags & KEEPALIVE) seq--;
- tcp_l2_buf_fill_headers(c, conn, iov, NULL, seq, false); + tcp_l2_buf_fill_headers(c, conn, iov, -1, seq, false);
tcp_l2_buf_pad(iov);
@@ -270,7 +270,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, ssize_t dlen, int no_csum, uint32_t seq, bool push) { struct tcp_payload_t *payload; - const uint16_t *check = NULL; + int check = -1; struct iovec *iov;
conn->seq_to_tap = seq + dlen; @@ -279,9 +279,9 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, if (CONN_V4(conn)) { if (no_csum) { struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1]; - struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base; + const struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
- check = &iph->check; + check = iph->check; } iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]); } else if (CONN_V6(conn)) { diff --git a/tcp_internal.h b/tcp_internal.h index d9408852571f..bb7a6629839c 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -187,8 +187,7 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, struct ethhdr *eh, struct iphdr *ip4h, struct ipv6hdr *ip6h, struct tcphdr *th, struct iov_tail *payload, - const uint16_t *ip4_check, uint32_t seq, - bool no_tcp_csum); + int ip4_check, uint32_t seq, bool no_tcp_csum);
int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, bool force_seq, struct tcp_info_linux *tinfo); diff --git a/tcp_vu.c b/tcp_vu.c index 3001defb5467..a21ee3499aed 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -138,7 +138,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) seq--;
tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload, - NULL, seq, !*c->pcap); + -1, seq, !*c->pcap);
if (*c->pcap) pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN); @@ -283,7 +283,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, */ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn, struct iovec *iov, size_t iov_cnt, - const uint16_t **check, bool no_tcp_csum, bool push) + int *check, bool no_tcp_csum, bool push) { const struct flowside *toside = TAPFLOW(conn); bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)); @@ -329,7 +329,7 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn, tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload, *check, conn->seq_to_tap, no_tcp_csum); if (ip4h) - *check = &ip4h->check; + *check = ip4h->check; }
/** @@ -350,7 +350,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) size_t hdrlen, fillsize; int v6 = CONN_V6(conn); uint32_t already_sent; - const uint16_t *check; + int check;
if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) { debug("Got packet, but RX virtqueue not usable yet"); @@ -437,7 +437,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) */
hdrlen = tcp_vu_hdrlen(v6); - for (i = 0, previous_dlen = -1, check = NULL; i < head_cnt; i++) { + for (i = 0, previous_dlen = -1, check = -1; i < head_cnt; i++) { struct iovec *iov = &elem[head[i]].in_sg[0]; int buf_cnt = head[i + 1] - head[i]; size_t frame_size = iov_size(iov, buf_cnt); @@ -451,7 +451,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
/* The IPv4 header checksum varies only with dlen */ if (previous_dlen != dlen) - check = NULL; + check = -1; previous_dlen = dlen;
tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push); -- 2.53.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, Mar 23, 2026 at 05:52:54PM +0100, Laurent Vivier wrote:
Instead of receiving individual pointers to each protocol header (eh, ip4h, ip6h, th), have tcp_fill_headers() take an iov_tail starting at the Ethernet header and walk through it using with_header() and IOV_DROP_HEADER() to access each header in turn.
Replace the ip4h/ip6h NULL-pointer convention with a bool ipv4 parameter, and move Ethernet header filling (MAC address and ethertype) into tcp_fill_headers() as well, since the function now owns the full header chain.
This simplifies callers, which no longer need to extract and pass individual header pointers.
Signed-off-by: Laurent Vivier
--- tcp.c | 106 +++++++++++++++++++++++++++---------------------- tcp_buf.c | 16 ++------ tcp_internal.h | 4 +- tcp_vu.c | 12 +++--- 4 files changed, 70 insertions(+), 68 deletions(-) diff --git a/tcp.c b/tcp.c index 158a5be0327e..058792d5b184 100644 --- a/tcp.c +++ b/tcp.c @@ -938,11 +938,8 @@ static void tcp_fill_header(struct tcphdr *th, * tcp_fill_headers() - Fill 802.3, IP, TCP headers * @c: Execution context * @conn: Connection pointer - * @eh: Pointer to Ethernet header - * @ip4h: Pointer to IPv4 header, or NULL - * @ip6h: Pointer to IPv6 header, or NULL - * @th: Pointer to TCP header - * @payload: TCP payload + * @ipv4: True for IPv4, false for IPv6 + * @payload: IOV tail starting at the Ethernet header
As UDP I'm not sure I like removing the separater parameters. A little more so, since this doesn't (any more) have the ugly udp_payload_t equivalent. At minimum @payload is no longer a good name.
* @ip4_check: IPv4 checksum, if already known * @seq: Sequence number for this segment * @no_tcp_csum: Do not set TCP checksum @@ -950,74 +947,89 @@ static void tcp_fill_header(struct tcphdr *th, * Return: frame length (including L2 headers) */ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, - struct ethhdr *eh, - struct iphdr *ip4h, struct ipv6hdr *ip6h, - struct tcphdr *th, struct iov_tail *payload, + bool ipv4, const struct iov_tail *payload, int ip4_check, uint32_t seq, bool no_tcp_csum) { const struct flowside *tapside = TAPFLOW(conn); - size_t l4len = iov_tail_size(payload) + sizeof(*th); + struct iov_tail current = *payload; uint8_t *omac = conn->f.tap_omac; - size_t l3len = l4len; + size_t l3len, l4len; uint32_t psum = 0;
- if (ip4h) { + with_header(struct ethhdr, eh, ¤t) { + if (ipv4) + eh->h_proto = htons_constant(ETH_P_IP); + else + eh->h_proto = htons_constant(ETH_P_IPV6); + + /* Find if neighbour table has a recorded MAC address */ + if (MAC_IS_UNDEF(omac)) + fwd_neigh_mac_get(c, &tapside->oaddr, omac); + eth_update_mac(eh, NULL, omac); + } + IOV_DROP_HEADER(¤t, struct ethhdr); + + l3len = iov_tail_size(¤t); + + if (ipv4) { const struct in_addr *src4 = inany_v4(&tapside->oaddr); const struct in_addr *dst4 = inany_v4(&tapside->eaddr);
assert(src4 && dst4);
- l3len += + sizeof(*ip4h); + l4len = l3len - sizeof(struct iphdr);
- ip4h->tot_len = htons(l3len); - ip4h->saddr = src4->s_addr; - ip4h->daddr = dst4->s_addr; + with_header(struct iphdr, ip4h, ¤t) { + ip4h->tot_len = htons(l3len); + ip4h->saddr = src4->s_addr; + ip4h->daddr = dst4->s_addr;
- if (ip4_check != -1) - ip4h->check = ip4_check; - else - ip4h->check = csum_ip4_header(l3len, IPPROTO_TCP, - *src4, *dst4); + if (ip4_check != -1) + ip4h->check = ip4_check; + else + ip4h->check = csum_ip4_header(l3len, + IPPROTO_TCP, + *src4, *dst4); + } + IOV_DROP_HEADER(¤t, struct iphdr);
if (!no_tcp_csum) { psum = proto_ipv4_header_psum(l4len, IPPROTO_TCP, *src4, *dst4); } - eh->h_proto = htons_constant(ETH_P_IP); - } - - if (ip6h) { - l3len += sizeof(*ip6h); + } else { + l4len = l3len - sizeof(struct ipv6hdr);
- ip6h->payload_len = htons(l4len); - ip6h->saddr = tapside->oaddr.a6; - ip6h->daddr = tapside->eaddr.a6; + with_header(struct ipv6hdr, ip6h, ¤t) { + ip6h->payload_len = htons(l4len); + ip6h->saddr = tapside->oaddr.a6; + ip6h->daddr = tapside->eaddr.a6;
- ip6h->hop_limit = 255; - ip6h->version = 6; - ip6h->nexthdr = IPPROTO_TCP; + ip6h->hop_limit = 255; + ip6h->version = 6; + ip6h->nexthdr = IPPROTO_TCP;
- ip6_set_flow_lbl(ip6h, conn->sock); + ip6_set_flow_lbl(ip6h, conn->sock);
- if (!no_tcp_csum) { - psum = proto_ipv6_header_psum(l4len, IPPROTO_TCP, - &ip6h->saddr, - &ip6h->daddr); + if (!no_tcp_csum) { + psum = proto_ipv6_header_psum(l4len, + IPPROTO_TCP, + &ip6h->saddr, + &ip6h->daddr); + } } - eh->h_proto = htons_constant(ETH_P_IPV6); + IOV_DROP_HEADER(¤t, struct ipv6hdr); }
- /* Find if neighbour table has a recorded MAC address */ - if (MAC_IS_UNDEF(omac)) - fwd_neigh_mac_get(c, &tapside->oaddr, omac); - eth_update_mac(eh, NULL, omac); - - tcp_fill_header(th, conn, seq); - - if (no_tcp_csum) + with_header(struct tcphdr, th, ¤t) { + tcp_fill_header(th, conn, seq); th->check = 0; - else - tcp_update_csum(psum, th, payload); + } + + if (!no_tcp_csum) { + with_header(struct tcphdr, th, ¤t) + th->check = csum_iov_tail(¤t, psum); + }
return MAX(l3len + sizeof(struct ethhdr), ETH_ZLEN); } diff --git a/tcp_buf.c b/tcp_buf.c index bc0f58dd7a5e..891043c96dcb 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -175,23 +175,15 @@ static void tcp_l2_buf_fill_headers(const struct ctx *c, struct iovec *iov, int check, uint32_t seq, bool no_tcp_csum) { - struct iov_tail tail = IOV_TAIL(&iov[TCP_IOV_PAYLOAD], 1, 0); - struct tcphdr th_storage, *th = IOV_REMOVE_HEADER(&tail, th_storage); + struct iov_tail tail = IOV_TAIL(&iov[TCP_IOV_ETH], + TCP_IOV_PAYLOAD + 1 - TCP_IOV_ETH, 0); struct tap_hdr *taph = iov[TCP_IOV_TAP].iov_base; const struct flowside *tapside = TAPFLOW(conn); - const struct in_addr *a4 = inany_v4(&tapside->oaddr); - struct ethhdr *eh = iov[TCP_IOV_ETH].iov_base; - struct ipv6hdr *ip6h = NULL; - struct iphdr *ip4h = NULL; + bool ipv4 = inany_v4(&tapside->oaddr) != NULL; size_t l2len;
- if (a4) - ip4h = iov[TCP_IOV_IP].iov_base; - else - ip6h = iov[TCP_IOV_IP].iov_base; + l2len = tcp_fill_headers(c, conn, ipv4, &tail, check, seq, no_tcp_csum);
- l2len = tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &tail, check, seq, - no_tcp_csum); tap_hdr_update(taph, l2len); }
diff --git a/tcp_internal.h b/tcp_internal.h index bb7a6629839c..136e947f6e70 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -184,9 +184,7 @@ void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn); struct tcp_info_linux;
size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, - struct ethhdr *eh, - struct iphdr *ip4h, struct ipv6hdr *ip6h, - struct tcphdr *th, struct iov_tail *payload, + bool ipv4, const struct iov_tail *payload, int ip4_check, uint32_t seq, bool no_tcp_csum);
int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, diff --git a/tcp_vu.c b/tcp_vu.c index a21ee3499aed..c6206b7a689c 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -132,13 +132,12 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) }
vu_pad(&flags_iov[0], 1, 0, hdrlen + optlen); - payload = IOV_TAIL(flags_elem[0].in_sg, 1, hdrlen);
if (flags & KEEPALIVE) seq--;
- tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload, - -1, seq, !*c->pcap); + payload = IOV_TAIL(flags_elem[0].in_sg, 1, VNET_HLEN); + tcp_fill_headers(c, conn, CONN_V4(conn), &payload, -1, seq, !*c->pcap);
if (*c->pcap) pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN); @@ -288,10 +287,10 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn, const struct flowside *toside = TAPFLOW(conn); bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)); size_t hdrlen = tcp_vu_hdrlen(v6); - struct iov_tail payload = IOV_TAIL(iov, iov_cnt, hdrlen); char *base = iov[0].iov_base; struct ipv6hdr *ip6h = NULL; struct iphdr *ip4h = NULL; + struct iov_tail payload; struct tcphdr *th; struct ethhdr *eh;
@@ -326,8 +325,9 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn, th->ack = 1; th->psh = push;
- tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload, - *check, conn->seq_to_tap, no_tcp_csum); + payload = IOV_TAIL(iov, iov_cnt, VNET_HLEN); + tcp_fill_headers(c, conn, !v6, &payload, *check, conn->seq_to_tap, + no_tcp_csum); if (ip4h) *check = ip4h->check; } -- 2.53.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On 3/24/26 04:53, David Gibson wrote:
On Mon, Mar 23, 2026 at 05:52:53PM +0100, Laurent Vivier wrote:
tcp_fill_headers() takes a pointer to a previously computed IPv4 header checksum to avoid recalculating it when the payload length doesn't change. A subsequent patch makes tcp_fill_headers() access ip4h via with_header() which scopes it to a temporary variable, so a pointer to ip4h->check would become dangling after the with_header() block.
Oof, that kind of indicates the dangers with the with_header() structure. Is that change merged already? If not we should probably fix it before merge rather than after the fact.
The problem appears in the subsequent patches of the series. So at this point we prevent the problem, don't fix it.
Pass the checksum by value as an int instead, using -1 as the sentinel to indicate that the checksum should be computed from scratch (replacing the NULL pointer sentinel). As the checksum is a uint16_t, -1 cannot be a valid checksum value in an int.
That said, passing this by value I think is cleaner than the pointer, regardless of other reasons. Would it also make sense to flag no_tcp_csum with an additional special value (a #define to -2, or whatever) instead of using an extra parameter? Logically the checksum parameter would be: CALCULATE | UNNEEDED | specific value
I can extract the patch from this series to make this improvement on the current code. Thanks, Laurent
Reviewed-by: David Gibson
Signed-off-by: Laurent Vivier
--- tcp.c | 7 +++---- tcp_buf.c | 10 +++++----- tcp_internal.h | 3 +-- tcp_vu.c | 12 ++++++------ 4 files changed, 15 insertions(+), 17 deletions(-) diff --git a/tcp.c b/tcp.c index b14586249c4e..158a5be0327e 100644 --- a/tcp.c +++ b/tcp.c @@ -953,8 +953,7 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, struct ethhdr *eh, struct iphdr *ip4h, struct ipv6hdr *ip6h, struct tcphdr *th, struct iov_tail *payload, - const uint16_t *ip4_check, uint32_t seq, - bool no_tcp_csum) + int ip4_check, uint32_t seq, bool no_tcp_csum) { const struct flowside *tapside = TAPFLOW(conn); size_t l4len = iov_tail_size(payload) + sizeof(*th); @@ -974,8 +973,8 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, ip4h->saddr = src4->s_addr; ip4h->daddr = dst4->s_addr;
- if (ip4_check) - ip4h->check = *ip4_check; + if (ip4_check != -1) + ip4h->check = ip4_check; else ip4h->check = csum_ip4_header(l3len, IPPROTO_TCP, *src4, *dst4); diff --git a/tcp_buf.c b/tcp_buf.c index 41965b107567..bc0f58dd7a5e 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -172,7 +172,7 @@ static void tcp_l2_buf_pad(struct iovec *iov) */ static void tcp_l2_buf_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, - struct iovec *iov, const uint16_t *check, + struct iovec *iov, int check, uint32_t seq, bool no_tcp_csum) { struct iov_tail tail = IOV_TAIL(&iov[TCP_IOV_PAYLOAD], 1, 0); @@ -233,7 +233,7 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) if (flags & KEEPALIVE) seq--;
- tcp_l2_buf_fill_headers(c, conn, iov, NULL, seq, false); + tcp_l2_buf_fill_headers(c, conn, iov, -1, seq, false);
tcp_l2_buf_pad(iov);
@@ -270,7 +270,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, ssize_t dlen, int no_csum, uint32_t seq, bool push) { struct tcp_payload_t *payload; - const uint16_t *check = NULL; + int check = -1; struct iovec *iov;
conn->seq_to_tap = seq + dlen; @@ -279,9 +279,9 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, if (CONN_V4(conn)) { if (no_csum) { struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1]; - struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base; + const struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
- check = &iph->check; + check = iph->check; } iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]); } else if (CONN_V6(conn)) { diff --git a/tcp_internal.h b/tcp_internal.h index d9408852571f..bb7a6629839c 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -187,8 +187,7 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, struct ethhdr *eh, struct iphdr *ip4h, struct ipv6hdr *ip6h, struct tcphdr *th, struct iov_tail *payload, - const uint16_t *ip4_check, uint32_t seq, - bool no_tcp_csum); + int ip4_check, uint32_t seq, bool no_tcp_csum);
int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, bool force_seq, struct tcp_info_linux *tinfo); diff --git a/tcp_vu.c b/tcp_vu.c index 3001defb5467..a21ee3499aed 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -138,7 +138,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) seq--;
tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload, - NULL, seq, !*c->pcap); + -1, seq, !*c->pcap);
if (*c->pcap) pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN); @@ -283,7 +283,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, */ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn, struct iovec *iov, size_t iov_cnt, - const uint16_t **check, bool no_tcp_csum, bool push) + int *check, bool no_tcp_csum, bool push) { const struct flowside *toside = TAPFLOW(conn); bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)); @@ -329,7 +329,7 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn, tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload, *check, conn->seq_to_tap, no_tcp_csum); if (ip4h) - *check = &ip4h->check; + *check = ip4h->check; }
/** @@ -350,7 +350,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) size_t hdrlen, fillsize; int v6 = CONN_V6(conn); uint32_t already_sent; - const uint16_t *check; + int check;
if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) { debug("Got packet, but RX virtqueue not usable yet"); @@ -437,7 +437,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) */
hdrlen = tcp_vu_hdrlen(v6); - for (i = 0, previous_dlen = -1, check = NULL; i < head_cnt; i++) { + for (i = 0, previous_dlen = -1, check = -1; i < head_cnt; i++) { struct iovec *iov = &elem[head[i]].in_sg[0]; int buf_cnt = head[i + 1] - head[i]; size_t frame_size = iov_size(iov, buf_cnt); @@ -451,7 +451,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
/* The IPv4 header checksum varies only with dlen */ if (previous_dlen != dlen) - check = NULL; + check = -1; previous_dlen = dlen;
tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push); -- 2.53.0
On Tue, Mar 24, 2026 at 08:56:48AM +0100, Laurent Vivier wrote:
On 3/24/26 04:53, David Gibson wrote:
On Mon, Mar 23, 2026 at 05:52:53PM +0100, Laurent Vivier wrote:
tcp_fill_headers() takes a pointer to a previously computed IPv4 header checksum to avoid recalculating it when the payload length doesn't change. A subsequent patch makes tcp_fill_headers() access ip4h via with_header() which scopes it to a temporary variable, so a pointer to ip4h->check would become dangling after the with_header() block.
Oof, that kind of indicates the dangers with the with_header() structure. Is that change merged already? If not we should probably fix it before merge rather than after the fact.
The problem appears in the subsequent patches of the series. So at this point we prevent the problem, don't fix it.
Oh, sorry, I misread your comment which already said that.
Pass the checksum by value as an int instead, using -1 as the sentinel to indicate that the checksum should be computed from scratch (replacing the NULL pointer sentinel). As the checksum is a uint16_t, -1 cannot be a valid checksum value in an int.
That said, passing this by value I think is cleaner than the pointer, regardless of other reasons. Would it also make sense to flag no_tcp_csum with an additional special value (a #define to -2, or whatever) instead of using an extra parameter? Logically the checksum parameter would be: CALCULATE | UNNEEDED | specific value
I can extract the patch from this series to make this improvement on the current code.
Thanks, Laurent
Reviewed-by: David Gibson
Signed-off-by: Laurent Vivier
--- tcp.c | 7 +++---- tcp_buf.c | 10 +++++----- tcp_internal.h | 3 +-- tcp_vu.c | 12 ++++++------ 4 files changed, 15 insertions(+), 17 deletions(-) diff --git a/tcp.c b/tcp.c index b14586249c4e..158a5be0327e 100644 --- a/tcp.c +++ b/tcp.c @@ -953,8 +953,7 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, struct ethhdr *eh, struct iphdr *ip4h, struct ipv6hdr *ip6h, struct tcphdr *th, struct iov_tail *payload, - const uint16_t *ip4_check, uint32_t seq, - bool no_tcp_csum) + int ip4_check, uint32_t seq, bool no_tcp_csum) { const struct flowside *tapside = TAPFLOW(conn); size_t l4len = iov_tail_size(payload) + sizeof(*th); @@ -974,8 +973,8 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, ip4h->saddr = src4->s_addr; ip4h->daddr = dst4->s_addr; - if (ip4_check) - ip4h->check = *ip4_check; + if (ip4_check != -1) + ip4h->check = ip4_check; else ip4h->check = csum_ip4_header(l3len, IPPROTO_TCP, *src4, *dst4); diff --git a/tcp_buf.c b/tcp_buf.c index 41965b107567..bc0f58dd7a5e 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -172,7 +172,7 @@ static void tcp_l2_buf_pad(struct iovec *iov) */ static void tcp_l2_buf_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, - struct iovec *iov, const uint16_t *check, + struct iovec *iov, int check, uint32_t seq, bool no_tcp_csum) { struct iov_tail tail = IOV_TAIL(&iov[TCP_IOV_PAYLOAD], 1, 0); @@ -233,7 +233,7 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) if (flags & KEEPALIVE) seq--; - tcp_l2_buf_fill_headers(c, conn, iov, NULL, seq, false); + tcp_l2_buf_fill_headers(c, conn, iov, -1, seq, false); tcp_l2_buf_pad(iov); @@ -270,7 +270,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, ssize_t dlen, int no_csum, uint32_t seq, bool push) { struct tcp_payload_t *payload; - const uint16_t *check = NULL; + int check = -1; struct iovec *iov; conn->seq_to_tap = seq + dlen; @@ -279,9 +279,9 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, if (CONN_V4(conn)) { if (no_csum) { struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1]; - struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base; + const struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base; - check = &iph->check; + check = iph->check; } iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]); } else if (CONN_V6(conn)) { diff --git a/tcp_internal.h b/tcp_internal.h index d9408852571f..bb7a6629839c 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -187,8 +187,7 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, struct ethhdr *eh, struct iphdr *ip4h, struct ipv6hdr *ip6h, struct tcphdr *th, struct iov_tail *payload, - const uint16_t *ip4_check, uint32_t seq, - bool no_tcp_csum); + int ip4_check, uint32_t seq, bool no_tcp_csum); int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, bool force_seq, struct tcp_info_linux *tinfo); diff --git a/tcp_vu.c b/tcp_vu.c index 3001defb5467..a21ee3499aed 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -138,7 +138,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) seq--; tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload, - NULL, seq, !*c->pcap); + -1, seq, !*c->pcap); if (*c->pcap) pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN); @@ -283,7 +283,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, */ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn, struct iovec *iov, size_t iov_cnt, - const uint16_t **check, bool no_tcp_csum, bool push) + int *check, bool no_tcp_csum, bool push) { const struct flowside *toside = TAPFLOW(conn); bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)); @@ -329,7 +329,7 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn, tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload, *check, conn->seq_to_tap, no_tcp_csum); if (ip4h) - *check = &ip4h->check; + *check = ip4h->check; } /** @@ -350,7 +350,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) size_t hdrlen, fillsize; int v6 = CONN_V6(conn); uint32_t already_sent; - const uint16_t *check; + int check; if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) { debug("Got packet, but RX virtqueue not usable yet"); @@ -437,7 +437,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) */ hdrlen = tcp_vu_hdrlen(v6); - for (i = 0, previous_dlen = -1, check = NULL; i < head_cnt; i++) { + for (i = 0, previous_dlen = -1, check = -1; i < head_cnt; i++) { struct iovec *iov = &elem[head[i]].in_sg[0]; int buf_cnt = head[i + 1] - head[i]; size_t frame_size = iov_size(iov, buf_cnt); @@ -451,7 +451,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) /* The IPv4 header checksum varies only with dlen */ if (previous_dlen != dlen) - check = NULL; + check = -1; previous_dlen = dlen; tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push); -- 2.53.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, Mar 23, 2026 at 05:52:55PM +0100, Laurent Vivier wrote:
Replace direct pointer arithmetic using vu_eth(), vu_ip(), vu_payloadv4() and vu_payloadv6() with the iov_tail abstraction to build Ethernet, IP and TCP headers in tcp_vu_prepare().
This removes the assumption that all headers reside in a single contiguous iov entry.
Signed-off-by: Laurent Vivier
LGTM, with the exception of my general concerns about with_header().
--- tcp_vu.c | 61 ++++++++++++++++++++++++++------------------------------ 1 file changed, 28 insertions(+), 33 deletions(-)
diff --git a/tcp_vu.c b/tcp_vu.c index c6206b7a689c..a39f6ea018e9 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -222,6 +222,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, /* reserve space for headers in iov */ iov = &elem[elem_cnt].in_sg[0]; assert(iov->iov_len >= hdrlen); +
Unrelated change (usually I don't mind if it's a reasonable change adjacent to what you're doing, but this is an entirely different function).
iov->iov_base = (char *)iov->iov_base + hdrlen; iov->iov_len -= hdrlen; head[(*head_cnt)++] = elem_cnt; @@ -285,51 +286,45 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn, int *check, bool no_tcp_csum, bool push) { const struct flowside *toside = TAPFLOW(conn); - bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)); - size_t hdrlen = tcp_vu_hdrlen(v6); - char *base = iov[0].iov_base; - struct ipv6hdr *ip6h = NULL; - struct iphdr *ip4h = NULL; + bool ipv4 = inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr); struct iov_tail payload; - struct tcphdr *th; - struct ethhdr *eh; - - /* we guess the first iovec provided by the guest can embed - * all the headers needed by L2 frame, including any padding - */ - assert(iov[0].iov_len >= hdrlen); - - eh = vu_eth(base);
- memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest)); + payload = IOV_TAIL(iov, iov_cnt, VNET_HLEN); + with_header(struct ethhdr, eh, &payload) + memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest)); + IOV_DROP_HEADER(&payload, struct ethhdr);
/* initialize header */
- if (!v6) { - eh->h_proto = htons(ETH_P_IP); - - ip4h = vu_ip(base); - *ip4h = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP); - th = vu_payloadv4(base); + if (ipv4) { + with_header(struct iphdr, ip4h, &payload) + *ip4h = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP); + IOV_DROP_HEADER(&payload, struct iphdr); } else { - eh->h_proto = htons(ETH_P_IPV6); - - ip6h = vu_ip(base); - *ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP); - - th = vu_payloadv6(base); + with_header(struct ipv6hdr, ip6h, &payload) + *ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP); + IOV_DROP_HEADER(&payload, struct ipv6hdr); }
- memset(th, 0, sizeof(*th)); - th->doff = sizeof(*th) / 4; - th->ack = 1; - th->psh = push; + with_header(struct tcphdr, th, &payload) { + memset(th, 0, sizeof(*th)); + th->doff = sizeof(*th) / 4; + th->ack = 1; + th->psh = push; + }
payload = IOV_TAIL(iov, iov_cnt, VNET_HLEN); - tcp_fill_headers(c, conn, !v6, &payload, *check, conn->seq_to_tap, + tcp_fill_headers(c, conn, ipv4, &payload, *check, conn->seq_to_tap, no_tcp_csum); - if (ip4h) + if (ipv4) { + struct iphdr ip4h_storage; + const struct iphdr *ip4h; + + IOV_DROP_HEADER(&payload, struct ethhdr); + ip4h = IOV_PEEK_HEADER(&payload, ip4h_storage); + *check = ip4h->check; + } }
/** -- 2.53.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, Mar 23, 2026 at 05:52:52PM +0100, Laurent Vivier wrote:
This is the TCP counterpart to the UDP multi-iov series. It converts the TCP vhost-user receive path from direct pointer arithmetic (via vu_eth(), vu_ip(), etc.) to the iov_tail abstraction, removing the assumption that all headers reside in a single contiguous buffer.
With this series applied, the TCP path correctly handles virtio-net drivers that provide multiple buffers per virtqueue element (e.g. iPXE provides the vnet header in the first buffer and the frame payload in a second one), matching the support already present in the UDP path.
Based-on: 20260323143151.538673-1-lvivier@redhat.com
I didn't finish reviewing this series. However, I'm going to stop here and wait for the spin based on only truncating the iovs when they're flushed. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, Mar 23, 2026 at 05:52:56PM +0100, Laurent Vivier wrote:
Previously, tcp_vu_sock_recv() assumed a 1:1 mapping between virtqueue elements and iovecs (one iovec per element), enforced by an ASSERT. This prevented the use of virtqueue elements with multiple buffers (e.g. when mergeable rx buffers are not negotiated and headers are provided in a separate buffer).
Introduce a struct vu_frame to track per-frame metadata: the range of elements and iovecs that make up each frame, and the frame's total size. This replaces the head[] array which only tracked element indices.
A separate iov_msg[] array is built for recvmsg() by cloning the data portions (after stripping headers) using iov_tail helpers.
Then a frame truncation after recvmsg() properly walks the frame and element arrays to adjust iovec counts and element counts.
Signed-off-by: Laurent Vivier
--- tcp_vu.c | 161 ++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 107 insertions(+), 54 deletions(-) diff --git a/tcp_vu.c b/tcp_vu.c index a39f6ea018e9..96cd9da1caae 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -35,9 +35,24 @@ #include "vu_common.h" #include
-static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE + DISCARD_IOV_NUM]; +static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE]; static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE]; -static int head[VIRTQUEUE_MAX_SIZE + 1]; + +/** + * struct vu_frame - Descriptor for a TCP frame mapped to virtqueue elements + * @idx_element: Index of first element in elem[] for this frame + * @num_element: Number of virtqueue elements used by this frame + * @idx_iovec: Index of first iovec in iov_vu[] for this frame + * @num_iovec: Number of iovecs covering this frame's buffers + * @size: Total frame size including all headers
"all headers" is a bit context dependent. I assume it includes the Ethernet header, but I'm not sure about virtio_net_hdr_mrg_rxbuf. If it doesn't this should be called l2len. If it does.. maybe we need to invent a term. Usually I wouldn't consider the "frame" to include like mrg_rxbuf (I'd consider that a "hw" descriptor followed by the frame).
+ */ +static struct vu_frame { + int idx_element; + int num_element; + int idx_iovec; + int num_iovec; + size_t size; +} frame[VIRTQUEUE_MAX_SIZE];
/** * tcp_vu_hdrlen() - Sum size of all headers, from TCP to virtio-net @@ -173,8 +188,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) * @v6: Set for IPv6 connections * @already_sent: Number of bytes already sent * @fillsize: Maximum bytes to fill in guest-side receiving window - * @iov_cnt: number of iov (output) - * @head_cnt: Pointer to store the count of head iov entries (output) + * @elem_used: number of element (output) + * @frame_cnt: Pointer to store the number of frames (output) * * Return: number of bytes received from the socket, or a negative error code * on failure. @@ -182,58 +197,77 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, const struct tcp_tap_conn *conn, bool v6, uint32_t already_sent, size_t fillsize, - int *iov_cnt, int *head_cnt) + int *elem_used, int *frame_cnt) { + static struct iovec iov_msg[VIRTQUEUE_MAX_SIZE + DISCARD_IOV_NUM];
What's the rationale for making this static?
const struct vu_dev *vdev = c->vdev; struct msghdr mh_sock = { 0 }; uint16_t mss = MSS_GET(conn); size_t hdrlen, iov_used; int s = conn->sock; + ssize_t ret, dlen; int elem_cnt; - ssize_t ret; - int i; - - *iov_cnt = 0; + int i, j;
hdrlen = tcp_vu_hdrlen(v6);
+ *elem_used = 0; + iov_used = 0; elem_cnt = 0; - *head_cnt = 0; + *frame_cnt = 0; while (fillsize > 0 && elem_cnt < ARRAY_SIZE(elem) && - iov_used < VIRTQUEUE_MAX_SIZE) { - size_t frame_size, dlen, in_total; - struct iovec *iov; + iov_used < ARRAY_SIZE(iov_vu) && + *frame_cnt < ARRAY_SIZE(frame)) { + size_t frame_size, in_total; int cnt;
cnt = vu_collect(vdev, vq, &elem[elem_cnt], ARRAY_SIZE(elem) - elem_cnt, - &iov_vu[DISCARD_IOV_NUM + iov_used], - VIRTQUEUE_MAX_SIZE - iov_used, &in_total, + &iov_vu[iov_used], + ARRAY_SIZE(iov_vu) - iov_used, &in_total, MIN(mss, fillsize) + hdrlen, &frame_size); if (cnt == 0) break; - assert((size_t)cnt == in_total); /* one iovec per element */ + + frame[*frame_cnt].idx_element = elem_cnt; + frame[*frame_cnt].num_element = cnt; + frame[*frame_cnt].idx_iovec = iov_used; + frame[*frame_cnt].num_iovec = in_total; + frame[*frame_cnt].size = frame_size; + (*frame_cnt)++;
iov_used += in_total; - dlen = frame_size - hdrlen; + elem_cnt += cnt;
- /* reserve space for headers in iov */ - iov = &elem[elem_cnt].in_sg[0]; - assert(iov->iov_len >= hdrlen); + fillsize -= frame_size - hdrlen; + }
- iov->iov_base = (char *)iov->iov_base + hdrlen; - iov->iov_len -= hdrlen; - head[(*head_cnt)++] = elem_cnt; + /* build an iov array without headers */ + for (i = 0, j = DISCARD_IOV_NUM; i < *frame_cnt && + j < ARRAY_SIZE(iov_msg); i++) { + struct iov_tail data; + ssize_t cnt;
- fillsize -= dlen; - elem_cnt += cnt; + data = IOV_TAIL(&iov_vu[frame[i].idx_iovec], + frame[i].num_iovec, 0); + iov_drop_header(&data, hdrlen); + + cnt = iov_tail_clone(&iov_msg[j], ARRAY_SIZE(iov_msg) - j, + &data); + if (cnt == -1) + die("Missing entries in iov_msg");
IIUC that would indicate a bug in the sizing / setup of the arrays, in which case it should be an assert().
+ + j += cnt; }
- if (tcp_prepare_iov(&mh_sock, iov_vu, already_sent, elem_cnt)) + if (tcp_prepare_iov(&mh_sock, iov_msg, already_sent, + j - DISCARD_IOV_NUM)) { /* Expect caller to do a TCP reset */ + vu_queue_rewind(vq, elem_cnt); return -1; + }
do ret = recvmsg(s, &mh_sock, MSG_PEEK); @@ -247,28 +281,47 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, if (!peek_offset_cap) ret -= already_sent;
- i = vu_pad(&iov_vu[DISCARD_IOV_NUM], iov_used, hdrlen, ret); + dlen = ret;
- /* adjust head count */ - while (*head_cnt > 0 && head[*head_cnt - 1] >= i) - (*head_cnt)--; + /* truncate frame */ + *elem_used = 0; + for (i = 0; i < *frame_cnt; i++) { + struct vu_frame *f = &frame[i];
- /* mark end of array */ - head[*head_cnt] = i; - *iov_cnt = i; + if ((size_t)ret <= f->size - hdrlen) { + unsigned cnt;
- /* release unused buffers */ - vu_queue_rewind(vq, elem_cnt - i); + cnt = vu_pad(&iov_vu[f->idx_iovec], f->num_iovec, + 0, hdrlen + ret);
- /* restore space for headers in iov */ - for (i = 0; i < *head_cnt; i++) { - struct iovec *iov = &elem[head[i]].in_sg[0]; + f->size = ret + hdrlen; + f->num_iovec = cnt;
- iov->iov_base = (char *)iov->iov_base - hdrlen; - iov->iov_len += hdrlen; + for (j = 0; j < f->num_element; j++) { + struct vu_virtq_element *e; + + e = &elem[f->idx_element + j]; + if (cnt <= e->in_num) { + e->in_num = cnt; + j++; + break; + } + cnt -= e->in_num; + } + f->num_element = j; + *elem_used += j; + i++; + break; + } + *elem_used += f->num_element; + ret -= f->size - hdrlen; } + *frame_cnt = i;
- return ret; + /* release unused buffers */ + vu_queue_rewind(vq, elem_cnt - *elem_used); + + return dlen; }
/** @@ -341,7 +394,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) struct vu_dev *vdev = c->vdev; struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE]; ssize_t len, previous_dlen; - int i, iov_cnt, head_cnt; + int i, elem_cnt, frame_cnt; size_t hdrlen, fillsize; int v6 = CONN_V6(conn); uint32_t already_sent; @@ -381,7 +434,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) * data from the socket */ len = tcp_vu_sock_recv(c, vq, conn, v6, already_sent, fillsize, - &iov_cnt, &head_cnt); + &elem_cnt, &frame_cnt); if (len < 0) { if (len != -EAGAIN && len != -EWOULDBLOCK) { tcp_rst(c, conn); @@ -395,6 +448,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) }
if (!len) { + vu_queue_rewind(vq, elem_cnt); if (already_sent) { conn_flag(c, conn, STALLED); } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == @@ -432,33 +486,32 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) */
hdrlen = tcp_vu_hdrlen(v6); - for (i = 0, previous_dlen = -1, check = -1; i < head_cnt; i++) { - struct iovec *iov = &elem[head[i]].in_sg[0]; - int buf_cnt = head[i + 1] - head[i]; - size_t frame_size = iov_size(iov, buf_cnt); - bool push = i == head_cnt - 1; + for (i = 0, previous_dlen = -1, check = -1; i < frame_cnt; i++) { + struct iovec *iov = &iov_vu[frame[i].idx_iovec]; + int iov_cnt = frame[i].num_iovec; + bool push = i == frame_cnt - 1; ssize_t dlen;
- assert(frame_size >= hdrlen); + assert(frame[i].size >= hdrlen);
- dlen = frame_size - hdrlen; - vu_set_vnethdr(iov->iov_base, buf_cnt); + dlen = frame[i].size - hdrlen; + vu_set_vnethdr(iov->iov_base, frame[i].num_element);
/* The IPv4 header checksum varies only with dlen */ if (previous_dlen != dlen) check = -1; previous_dlen = dlen;
- tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push); + tcp_vu_prepare(c, conn, iov, iov_cnt, &check, !*c->pcap, push);
if (*c->pcap) - pcap_iov(iov, buf_cnt, VNET_HLEN); + pcap_iov(iov, iov_cnt, VNET_HLEN);
conn->seq_to_tap += dlen; }
/* send packets */ - vu_flush(vdev, vq, elem, iov_cnt); + vu_flush(vdev, vq, elem, elem_cnt);
conn_flag(c, conn, ACK_FROM_TAP_DUE);
-- 2.53.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (2)
-
David Gibson
-
Laurent Vivier