[PATCH v4 00/10] vhost-user: Preparatory series for multiple iovec entries per virtqueue element

newer
[PATCH 0/2] Fix equence collision...

Laurent Vivier

13 May 2026 13 May '26

1:52 p.m.

Currently, the vhost-user path assumes each virtqueue element contains exactly one iovec entry covering the entire frame. This assumption breaks as some virtio-net drivers (notably iPXE) provide descriptors where the vnet header and the frame payload are in separate buffers, resulting in two iovec entries per virtqueue element. This series refactors the vhost-user data path so that frame lengths, header sizes, and padding are tracked and passed explicitly rather than being derived from iovec sizes. This decoupling is a prerequisite for correctly handling padding of multi-buffer frames. The changes in this series can be split in 3 groups: - New iov helpers (patches 1-2): iov_memset() and iov_memcpy() operate across iovec boundaries. These are needed by the final patch to pad and copy frame data when a frame spans multiple iovec entries. - Structural refactoring (patches 3-5): Move vnethdr setup into vu_flush(), separate virtqueue management from socket I/O in the UDP path, and pass iov arrays explicitly instead of using file-scoped state. These changes make it possible to pass explicit frame lengths through the stack, which is required to pad frames independently of iovec layout. - Explicit length passing throughout the stack (patches 6-10): Thread explicit L4, L2, frame, and data lengths through checksum, pcap, vu_flush(), and tcp_fill_headers(), replacing lengths that were previously derived from iovec sizes. With lengths tracked explicitly, the final patch can centralise Ethernet frame padding into vu_collect() and a new vu_pad() helper that correctly pads frames spanning multiple iovec entries. v4: - rebase - iov_memcpy: use size_t for loop indices i and j - udp_vu: reorder elem[] declaration for inverted christmas tree style - pcap: wrap pcap_iov() declaration and definition to respect line length - write_remainder(): update length parameter description - Add Reviewed-by tags from Jon and David v3: - csum_udp4()/csum_udp6()/udp_vu_csum receive payload length (dlen) rather than l4len - Add a length parameter to write_remainder() and use it in pcap_frame() v2: - Rename iov_memcopy() to iov_memcpy() and use clearer parameter names - Use clearer code in pcap_frame() - Add braces around bodies in pcap.c and tcp_vu.c for style consistency - Extract l2len variable in tap_add_packet() and tcp_vu_send_flag() to avoid repeating the same expression - Fix indentation alignment of iov_skip_bytes() arguments in tcp_vu_c - Introduce fill_size variable in vu_flush() - Reposition comment for ETH_ZLEN in vu_collect() Laurent Vivier (10): iov: Introduce iov_memset() iov: Add iov_memcpy() to copy data between iovec arrays vu_common: Move vnethdr setup into vu_flush() udp_vu: Move virtqueue management from udp_vu_sock_recv() to its caller udp_vu: Pass iov explicitly to helpers instead of using file-scoped array checksum: Pass explicit L4 length to checksum functions pcap: Pass explicit L2 length to pcap_iov() vu_common: Pass explicit frame length to vu_flush() tcp: Pass explicit data length to tcp_fill_headers() vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad() checksum.c | 43 +++++++----- checksum.h | 6 +- iov.c | 77 ++++++++++++++++++++++ iov.h | 5 ++ pcap.c | 29 ++++++--- pcap.h | 3 +- tap.c | 10 +-- tcp.c | 14 ++-- tcp_buf.c | 3 +- tcp_internal.h | 2 +- tcp_vu.c | 66 ++++++++++--------- udp.c | 5 +- udp_vu.c | 173 +++++++++++++++++++++++++------------------------ util.c | 31 +++++++-- util.h | 3 +- vu_common.c | 58 ++++++++++------- vu_common.h | 5 +- 17 files changed, 339 insertions(+), 194 deletions(-) -- 2.54.0

Show replies by date

Laurent Vivier

13 May 13 May

14 May 14 May

3:24 a.m.

New subject: [PATCH v4 10/10] vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()

On Wed, May 13, 2026 at 01:52:18PM +0200, Laurent Vivier wrote:

...

The previous per-protocol padding done by vu_pad() in tcp_vu.c and udp_vu.c was only correct for single-buffer frames: it assumed the padding area always fell within the first iov, writing past its end with a plain memset().

It also required each caller to compute MAX(..., ETH_ZLEN + VNET_HLEN) for vu_collect() and to call vu_pad() at the right point, duplicating the minimum-size logic across protocols.

Move the Ethernet minimum size enforcement into vu_collect() itself, so that enough buffer space is always reserved for padding regardless of the requested frame size.

Rewrite vu_pad() to take a full iovec array and use iov_memset(), making it safe for multi-buffer (mergeable rx buffer) frames.

In tcp_vu_sock_recv(), replace iov_truncate() with iov_skip_bytes(): now that all consumers receive explicit data lengths, truncating the iovecs is no longer needed. In tcp_vu_data_from_sock(), cap each frame's data length against the remaining bytes actually received from the socket, so that the last partial frame gets correct headers and sequence number advancement.

Signed-off-by: Laurent Vivier Reviewed-by: Jon Maloy

Reviewed-by: David Gibson But following on from my comments on v3, a couple of clarity nits for possible follow up: [snip]

...

diff --git a/vu_common.c b/vu_common.c index 704e908aa02c..d07f584f228a 100644 --- a/vu_common.c +++ b/vu_common.c @@ -74,6 +74,7 @@ int vu_collect(const struct vu_dev *vdev, struct vu_virtq *vq, size_t current_iov = 0; int elem_cnt = 0;

+ size = MAX(size, ETH_ZLEN /* Ethernet minimum size */ + VNET_HLEN);

Here I think "size" is a reasonable name, since it's the size of the buffer we're obtaining, i.e. a bound, but not otherwise related to the length of the frame.

...

while (current_size < size && elem_cnt < max_elem && current_iov < max_in_sg) { int ret; @@ -261,29 +262,27 @@ int vu_send_single(const struct ctx *c, const void *buf, size_t size) return -1; }

- size += VNET_HLEN; elem_cnt = vu_collect(vdev, vq, elem, ARRAY_SIZE(elem), in_sg, - ARRAY_SIZE(in_sg), &in_total, size, &total); - if (elem_cnt == 0 || total < size) { + ARRAY_SIZE(in_sg), &in_total, VNET_HLEN + size, &total); + if (elem_cnt == 0 || total < VNET_HLEN + size) {

Here, "l2len" would be a much better name than "size".

...

debug("vu_send_single: no space to send the data " "elem_cnt %d size %zu", elem_cnt, total); goto err; }

- total -= VNET_HLEN; - /* copy data from the buffer to the iovec */ - iov_from_buf(in_sg, in_total, VNET_HLEN, buf, total); + iov_from_buf(in_sg, in_total, VNET_HLEN, buf, size);

if (*c->pcap) pcap_iov(in_sg, in_total, VNET_HLEN, size);

+ vu_pad(in_sg, in_total, VNET_HLEN + size); vu_flush(vdev, vq, elem, elem_cnt, VNET_HLEN + size); vu_queue_notify(vdev, vq);

- trace("vhost-user sent %zu", total); + trace("vhost-user sent %zu", size);

- return total; + return size; err: for (i = 0; i < elem_cnt; i++) vu_queue_detach_element(vq); @@ -292,15 +291,15 @@ err: }

/** - * vu_pad() - Pad 802.3 frame to minimum length (60 bytes) if needed - * @iov: Buffer in iovec array where end of 802.3 frame is stored - * @l2len: Layer-2 length already filled in frame + * vu_pad() - Pad short frames to minimum Ethernet length and truncate iovec + * @iov: Pointer to iovec array + * @cnt: Number of entries in @iov + * @frame_len: Data length in @iov (including virtio-net header) */ -void vu_pad(struct iovec *iov, size_t l2len) +void vu_pad(const struct iovec *iov, size_t cnt, size_t frame_len)

Here we have the actual frame length, including device header, but not padding. "frame_len" is different from the other standard names we use, so it's not terrible, but "frame" often refers to the L2 object so it's not great either. Not sure if 'l1len' or 'l0len' would be getting too cutesy with what "physical" layer means in a virtual network. Something like "device_len" maybe? But that should probably include padding as well. Or alternatively, vu_pad() could be updated to take l2len, and add VNET_HLEN inside.

...

{ - if (l2len >= ETH_ZLEN) - return; + size_t min_frame_len = ETH_ZLEN + VNET_HLEN;

- memset((char *)iov->iov_base + iov->iov_len, 0, ETH_ZLEN - l2len); - iov->iov_len += ETH_ZLEN - l2len; + if (frame_len < min_frame_len) + iov_memset(iov, cnt, frame_len, 0, min_frame_len - frame_len); } diff --git a/vu_common.h b/vu_common.h index 77d1849e6115..51f70084a7cb 100644 --- a/vu_common.h +++ b/vu_common.h @@ -44,6 +44,6 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, const struct timespec *now); int vu_send_single(const struct ctx *c, const void *buf, size_t size); -void vu_pad(struct iovec *iov, size_t l2len); +void vu_pad(const struct iovec *iov, size_t cnt, size_t frame_len);

#endif /* VU_COMMON_H */ -- 2.54.0

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

20 May 20 May

2:52 a.m.

On Wed, 13 May 2026 13:52:08 +0200 Laurent Vivier wrote:

...

Currently, the vhost-user path assumes each virtqueue element contains exactly one iovec entry covering the entire frame. This assumption breaks as some virtio-net drivers (notably iPXE) provide descriptors where the vnet header and the frame payload are in separate buffers, resulting in two iovec entries per virtqueue element.

This series refactors the vhost-user data path so that frame lengths, header sizes, and padding are tracked and passed explicitly rather than being derived from iovec sizes. This decoupling is a prerequisite for correctly handling padding of multi-buffer frames.

The changes in this series can be split in 3 groups:

- New iov helpers (patches 1-2):

iov_memset() and iov_memcpy() operate across iovec boundaries. These are needed by the final patch to pad and copy frame data when a frame spans multiple iovec entries.

- Structural refactoring (patches 3-5):

Move vnethdr setup into vu_flush(), separate virtqueue management from socket I/O in the UDP path, and pass iov arrays explicitly instead of using file-scoped state. These changes make it possible to pass explicit frame lengths through the stack, which is required to pad frames independently of iovec layout.

- Explicit length passing throughout the stack (patches 6-10):

Thread explicit L4, L2, frame, and data lengths through checksum, pcap, vu_flush(), and tcp_fill_headers(), replacing lengths that were previously derived from iovec sizes. With lengths tracked explicitly, the final patch can centralise Ethernet frame padding into vu_collect() and a new vu_pad() helper that correctly pads frames spanning multiple iovec entries.

v4: - rebase - iov_memcpy: use size_t for loop indices i and j - udp_vu: reorder elem[] declaration for inverted christmas tree style - pcap: wrap pcap_iov() declaration and definition to respect line length - write_remainder(): update length parameter description - Add Reviewed-by tags from Jon and David

Applied, sorry for the delay. -- Stefano

Stefano Brivio

5:34 p.m.

On Wed, 13 May 2026 13:52:08 +0200 Laurent Vivier wrote:

...

Currently, the vhost-user path assumes each virtqueue element contains exactly one iovec entry covering the entire frame. This assumption breaks as some virtio-net drivers (notably iPXE) provide descriptors where the vnet header and the frame payload are in separate buffers, resulting in two iovec entries per virtqueue element.

This series refactors the vhost-user data path so that frame lengths, header sizes, and padding are tracked and passed explicitly rather than being derived from iovec sizes. This decoupling is a prerequisite for correctly handling padding of multi-buffer frames.

Sorry to bring (likely) bad news, but this series seems to introduce a regression: I got the migration/rampstream_in tests fail twice in a row, which I've never saw happening (I think I saw a single failure a long time ago when the machine had a high CPU load, but nothing else). I'm currently bisecting and the bisect seems to point towards the end of the series (probably 10/10), but I haven't finished yet. I'll keep you posted. I haven't spotted anything that might cause issues there. It's probably worth mentioning that after migration we send pretty small TCP frames (window probes), but I have no idea yet if that has anything to do. -- Stefano

Stefano Brivio

6:07 p.m.

On Wed, 20 May 2026 17:34:45 +0200 Stefano Brivio wrote:

...

On Wed, 13 May 2026 13:52:08 +0200 Laurent Vivier wrote:

...
Currently, the vhost-user path assumes each virtqueue element contains exactly one iovec entry covering the entire frame. This assumption breaks as some virtio-net drivers (notably iPXE) provide descriptors where the vnet header and the frame payload are in separate buffers, resulting in two iovec entries per virtqueue element.

This series refactors the vhost-user data path so that frame lengths, header sizes, and padding are tracked and passed explicitly rather than being derived from iovec sizes. This decoupling is a prerequisite for correctly handling padding of multi-buffer frames.

Sorry to bring (likely) bad news, but this series seems to introduce a regression: I got the migration/rampstream_in tests fail twice in a row, which I've never saw happening (I think I saw a single failure a long time ago when the machine had a high CPU load, but nothing else).

I'm currently bisecting and the bisect seems to point towards the end of the series (probably 10/10), but I haven't finished yet. I'll keep you posted. I haven't spotted anything that might cause issues there.

Yeah, that's the one :( $ git bisect bad db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit commit db798fc60f4c5869cb53168354e068fb4dabd91a Author: Laurent Vivier Date: Wed May 13 13:52:18 2026 +0200 vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad() The "TCP/IPv4: sequence check, ramps, inbound" test in rampstream_in gets stuck, once the source is done with the migration, and passt on the destination just printed: Accepted TCP_REPAIR helper, PID 13 accepted connection from PID 16 I'll get captures and logs next. It seems to fail most of the times, I had two failures in a row again. -- Stefano

Stefano Brivio

6:18 p.m.

On Wed, 20 May 2026 18:07:08 +0200 Stefano Brivio wrote:

...

On Wed, 20 May 2026 17:34:45 +0200 Stefano Brivio wrote:

...
On Wed, 13 May 2026 13:52:08 +0200 Laurent Vivier wrote:

...
Currently, the vhost-user path assumes each virtqueue element contains exactly one iovec entry covering the entire frame. This assumption breaks as some virtio-net drivers (notably iPXE) provide descriptors where the vnet header and the frame payload are in separate buffers, resulting in two iovec entries per virtqueue element.

This series refactors the vhost-user data path so that frame lengths, header sizes, and padding are tracked and passed explicitly rather than being derived from iovec sizes. This decoupling is a prerequisite for correctly handling padding of multi-buffer frames.

Sorry to bring (likely) bad news, but this series seems to introduce a regression: I got the migration/rampstream_in tests fail twice in a row, which I've never saw happening (I think I saw a single failure a long time ago when the machine had a high CPU load, but nothing else).

I'm currently bisecting and the bisect seems to point towards the end of the series (probably 10/10), but I haven't finished yet. I'll keep you posted. I haven't spotted anything that might cause issues there.

Yeah, that's the one :(

$ git bisect bad db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit commit db798fc60f4c5869cb53168354e068fb4dabd91a Author: Laurent Vivier Date: Wed May 13 13:52:18 2026 +0200

vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()

The "TCP/IPv4: sequence check, ramps, inbound" test in rampstream_in gets stuck, once the source is done with the migration, and passt on the destination just printed:

Accepted TCP_REPAIR helper, PID 13 accepted connection from PID 16

I'll get captures and logs next. It seems to fail most of the times, I had two failures in a row again.

Log from passt --debug attached. Likely highlight: --- 13.2853: ================ Vhost user message ================ 13.2853: Request: VHOST_USER_SET_VRING_ADDR (9) 13.2853: Flags: 0x1 13.2853: Size: 40 13.2853: vhost_vring_addr: 13.2853: index: 0 13.2853: flags: 0 13.2853: desc_user_addr: 0x00007f0943f41000 13.2853: used_user_addr: 0x00007f0943f42240 13.2854: avail_user_addr: 0x00007f0943f42000 13.2854: log_guest_addr: 0x000000001ff43240 13.2854: Setting virtq addresses: 13.2854: vring_desc at 0x7f2e2e2ca000 13.2854: vring_used at 0x7f2e2e2cb240 13.2854: vring_avail at 0x7f2e2e2cb000 13.2854: Last avail index != used index: 2163 != 1936 13.2854: Got packet, but RX virtqueue not usable yet --- pcap file of that passt instance empty, it didn't have a chance to send/receive packets yet. -- Stefano

Stefano Brivio

10:53 p.m.

On Wed, 20 May 2026 18:18:52 +0200 Stefano Brivio wrote:

...

On Wed, 20 May 2026 18:07:08 +0200 Stefano Brivio wrote:

...
On Wed, 20 May 2026 17:34:45 +0200 Stefano Brivio wrote:

...
On Wed, 13 May 2026 13:52:08 +0200 Laurent Vivier wrote:

...
Currently, the vhost-user path assumes each virtqueue element contains exactly one iovec entry covering the entire frame. This assumption breaks as some virtio-net drivers (notably iPXE) provide descriptors where the vnet header and the frame payload are in separate buffers, resulting in two iovec entries per virtqueue element.

This series refactors the vhost-user data path so that frame lengths, header sizes, and padding are tracked and passed explicitly rather than being derived from iovec sizes. This decoupling is a prerequisite for correctly handling padding of multi-buffer frames.

Sorry to bring (likely) bad news, but this series seems to introduce a regression: I got the migration/rampstream_in tests fail twice in a row, which I've never saw happening (I think I saw a single failure a long time ago when the machine had a high CPU load, but nothing else).

I'm currently bisecting and the bisect seems to point towards the end of the series (probably 10/10), but I haven't finished yet. I'll keep you posted. I haven't spotted anything that might cause issues there.

Yeah, that's the one :(

$ git bisect bad db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit commit db798fc60f4c5869cb53168354e068fb4dabd91a Author: Laurent Vivier Date: Wed May 13 13:52:18 2026 +0200

vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()

The "TCP/IPv4: sequence check, ramps, inbound" test in rampstream_in gets stuck, once the source is done with the migration, and passt on the destination just printed:

Accepted TCP_REPAIR helper, PID 13 accepted connection from PID 16

I'll get captures and logs next. It seems to fail most of the times, I had two failures in a row again.

Log from passt --debug attached. Likely highlight:

--- 13.2853: ================ Vhost user message ================ 13.2853: Request: VHOST_USER_SET_VRING_ADDR (9) 13.2853: Flags: 0x1 13.2853: Size: 40 13.2853: vhost_vring_addr: 13.2853: index: 0 13.2853: flags: 0 13.2853: desc_user_addr: 0x00007f0943f41000 13.2853: used_user_addr: 0x00007f0943f42240 13.2854: avail_user_addr: 0x00007f0943f42000 13.2854: log_guest_addr: 0x000000001ff43240 13.2854: Setting virtq addresses: 13.2854: vring_desc at 0x7f2e2e2ca000 13.2854: vring_used at 0x7f2e2e2cb240 13.2854: vring_avail at 0x7f2e2e2cb000 13.2854: Last avail index != used index: 2163 != 1936 13.2854: Got packet, but RX virtqueue not usable yet ---

pcap file of that passt instance empty, it didn't have a chance to send/receive packets yet.

...but I bisected 10/10 itself, and realised that reverting the iov_truncate() -> iov_skip_bytes() conversion in tcp_vu_sock_recv() like this: --- diff --git a/tcp_vu.c b/tcp_vu.c index f6ac76e..ccc031e 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -249,11 +249,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, if (!peek_offset_cap) ret -= already_sent; - i = iov_skip_bytes(&iov_vu[DISCARD_IOV_NUM], iov_used, - MAX(hdrlen + ret, VNET_HLEN + ETH_ZLEN), - NULL); - if ((size_t)i < iov_used) - i++; + i = iov_truncate(&iov_vu[DISCARD_IOV_NUM], iov_used, ret); /* adjust head count */ while (*head_cnt > 0 && head[*head_cnt - 1] >= i) --- hides / fixes the issue. I'm testing things on a kernel without SO_PEEK_OFF support for TCP, but it doesn't seem to matter ('ret' at this point is the same before and after your patch). I don't see what's wrong with your change though. It's not even about replacing 'ret' with the padded version, because I can also reproduce the issue with: i = iov_skip_bytes(&iov_vu[DISCARD_IOV_NUM], iov_used, ret, NULL); For convenience, this is how I'm selecting the test without bothering about variables in run(): --- diff --git a/test/run b/test/run index f858e55..25d7002 100755 --- a/test/run +++ b/test/run @@ -71,6 +71,7 @@ run() { perf_init [ ${CI} -eq 1 ] && video_start ci +dont() { exeter smoke/smoke.sh exeter build/build.py exeter build/static_checkers.sh @@ -162,6 +163,10 @@ run() { setup migrate test migrate/iperf3_many_out6 teardown migrate +} + VHOST_USER=1 + VALGRIND=0 + setup migrate test migrate/rampstream_in teardown migrate --- -- Stefano

Laurent Vivier

21 May 21 May

10:30 a.m.

On 5/20/26 22:53, Stefano Brivio wrote:

...

On Wed, 20 May 2026 18:18:52 +0200 Stefano Brivio wrote:

...
On Wed, 20 May 2026 18:07:08 +0200 Stefano Brivio wrote:

...
On Wed, 20 May 2026 17:34:45 +0200 Stefano Brivio wrote:

...
On Wed, 13 May 2026 13:52:08 +0200 Laurent Vivier wrote:

...
Currently, the vhost-user path assumes each virtqueue element contains exactly one iovec entry covering the entire frame. This assumption breaks as some virtio-net drivers (notably iPXE) provide descriptors where the vnet header and the frame payload are in separate buffers, resulting in two iovec entries per virtqueue element.

This series refactors the vhost-user data path so that frame lengths, header sizes, and padding are tracked and passed explicitly rather than being derived from iovec sizes. This decoupling is a prerequisite for correctly handling padding of multi-buffer frames.

Sorry to bring (likely) bad news, but this series seems to introduce a regression: I got the migration/rampstream_in tests fail twice in a row, which I've never saw happening (I think I saw a single failure a long time ago when the machine had a high CPU load, but nothing else).

I'm currently bisecting and the bisect seems to point towards the end of the series (probably 10/10), but I haven't finished yet. I'll keep you posted. I haven't spotted anything that might cause issues there.

Yeah, that's the one :(

$ git bisect bad db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit commit db798fc60f4c5869cb53168354e068fb4dabd91a Author: Laurent Vivier Date: Wed May 13 13:52:18 2026 +0200

vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()

I checked on my system with the commit previous to this series, bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not everytime).

...

TCP/IPv4: sequence check, ramps, inbound ...failed.

and rampstream_out hangs sometime too. I'm going to try with ealier commits. Thanks, Laurent

...

...
...
The "TCP/IPv4: sequence check, ramps, inbound" test in rampstream_in gets stuck, once the source is done with the migration, and passt on the destination just printed:

Accepted TCP_REPAIR helper, PID 13 accepted connection from PID 16

I'll get captures and logs next. It seems to fail most of the times, I had two failures in a row again.

Log from passt --debug attached. Likely highlight:

--- 13.2853: ================ Vhost user message ================ 13.2853: Request: VHOST_USER_SET_VRING_ADDR (9) 13.2853: Flags: 0x1 13.2853: Size: 40 13.2853: vhost_vring_addr: 13.2853: index: 0 13.2853: flags: 0 13.2853: desc_user_addr: 0x00007f0943f41000 13.2853: used_user_addr: 0x00007f0943f42240 13.2854: avail_user_addr: 0x00007f0943f42000 13.2854: log_guest_addr: 0x000000001ff43240 13.2854: Setting virtq addresses: 13.2854: vring_desc at 0x7f2e2e2ca000 13.2854: vring_used at 0x7f2e2e2cb240 13.2854: vring_avail at 0x7f2e2e2cb000 13.2854: Last avail index != used index: 2163 != 1936 13.2854: Got packet, but RX virtqueue not usable yet ---

pcap file of that passt instance empty, it didn't have a chance to send/receive packets yet.

...but I bisected 10/10 itself, and realised that reverting the iov_truncate() -> iov_skip_bytes() conversion in tcp_vu_sock_recv() like this:

--- diff --git a/tcp_vu.c b/tcp_vu.c index f6ac76e..ccc031e 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -249,11 +249,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, if (!peek_offset_cap) ret -= already_sent;

- i = iov_skip_bytes(&iov_vu[DISCARD_IOV_NUM], iov_used, - MAX(hdrlen + ret, VNET_HLEN + ETH_ZLEN), - NULL); - if ((size_t)i < iov_used) - i++; + i = iov_truncate(&iov_vu[DISCARD_IOV_NUM], iov_used, ret);

/* adjust head count */ while (*head_cnt > 0 && head[*head_cnt - 1] >= i) ---

hides / fixes the issue.

I'm testing things on a kernel without SO_PEEK_OFF support for TCP, but it doesn't seem to matter ('ret' at this point is the same before and after your patch).

I don't see what's wrong with your change though. It's not even about replacing 'ret' with the padded version, because I can also reproduce the issue with:

i = iov_skip_bytes(&iov_vu[DISCARD_IOV_NUM], iov_used, ret, NULL);

For convenience, this is how I'm selecting the test without bothering about variables in run():

--- diff --git a/test/run b/test/run index f858e55..25d7002 100755 --- a/test/run +++ b/test/run @@ -71,6 +71,7 @@ run() { perf_init [ ${CI} -eq 1 ] && video_start ci

+dont() { exeter smoke/smoke.sh exeter build/build.py exeter build/static_checkers.sh @@ -162,6 +163,10 @@ run() { setup migrate test migrate/iperf3_many_out6 teardown migrate +} + VHOST_USER=1 + VALGRIND=0 + setup migrate test migrate/rampstream_in teardown migrate ---

Laurent Vivier

22 May 22 May

1:13 a.m.

On 5/21/26 10:30, Laurent Vivier wrote:

...

On 5/20/26 22:53, Stefano Brivio wrote:

...
On Wed, 20 May 2026 18:18:52 +0200 Stefano Brivio wrote:

...
On Wed, 20 May 2026 18:07:08 +0200 Stefano Brivio wrote:

...
On Wed, 20 May 2026 17:34:45 +0200 Stefano Brivio wrote:

...
On Wed, 13 May 2026 13:52:08 +0200 Laurent Vivier wrote:

...
Currently, the vhost-user path assumes each virtqueue element contains exactly one iovec entry covering the entire frame. This assumption breaks as some virtio-net drivers (notably iPXE) provide descriptors where the vnet header and the frame payload are in separate buffers, resulting in two iovec entries per virtqueue element.

This series refactors the vhost-user data path so that frame lengths, header sizes, and padding are tracked and passed explicitly rather than being derived from iovec sizes. This decoupling is a prerequisite for correctly handling padding of multi-buffer frames.

Sorry to bring (likely) bad news, but this series seems to introduce a regression: I got the migration/rampstream_in tests fail twice in a row, which I've never saw happening (I think I saw a single failure a long time ago when the machine had a high CPU load, but nothing else).

I'm currently bisecting and the bisect seems to point towards the end of the series (probably 10/10), but I haven't finished yet. I'll keep you posted. I haven't spotted anything that might cause issues there.

Yeah, that's the one :(

$ git bisect bad db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit commit db798fc60f4c5869cb53168354e068fb4dabd91a Author: Laurent Vivier Date: Wed May 13 13:52:18 2026 +0200

vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()

I checked on my system with the commit previous to this series, bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not everytime).

...
TCP/IPv4: sequence check, ramps, inbound ...failed.

and rampstream_out hangs sometime too.

I'm going to try with ealier commits.

For me the problem can happen with any commit... As it depends on the execution path and on the load and speed of the system it looks like a race condition. Did you try to test on a host with a kernel patched with "[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive" ? Thanks, Laurent

Stefano Brivio

6:22 a.m.

On Fri, 22 May 2026 01:13:33 +0200 Laurent Vivier wrote:

...

On 5/21/26 10:30, Laurent Vivier wrote:

...
On 5/20/26 22:53, Stefano Brivio wrote:

...
On Wed, 20 May 2026 18:18:52 +0200 Stefano Brivio wrote:

...
On Wed, 20 May 2026 18:07:08 +0200 Stefano Brivio wrote:

...
On Wed, 20 May 2026 17:34:45 +0200 Stefano Brivio wrote:

...
On Wed, 13 May 2026 13:52:08 +0200 Laurent Vivier wrote: > Currently, the vhost-user path assumes each virtqueue element contains > exactly one iovec entry covering the entire frame. This assumption > breaks as some virtio-net drivers (notably iPXE) provide descriptors where the > vnet header and the frame payload are in separate buffers, resulting in > two iovec entries per virtqueue element. > > This series refactors the vhost-user data path so that frame lengths, > header sizes, and padding are tracked and passed explicitly rather than > being derived from iovec sizes. This decoupling is a prerequisite for > correctly handling padding of multi-buffer frames.

Sorry to bring (likely) bad news, but this series seems to introduce a regression: I got the migration/rampstream_in tests fail twice in a row, which I've never saw happening (I think I saw a single failure a long time ago when the machine had a high CPU load, but nothing else).

I'm currently bisecting and the bisect seems to point towards the end of the series (probably 10/10), but I haven't finished yet. I'll keep you posted. I haven't spotted anything that might cause issues there.

Yeah, that's the one :(

$ git bisect bad db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit commit db798fc60f4c5869cb53168354e068fb4dabd91a Author: Laurent Vivier Date: Wed May 13 13:52:18 2026 +0200

vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()

I checked on my system with the commit previous to this series, bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not everytime).

...
TCP/IPv4: sequence check, ramps, inbound ...failed.

and rampstream_out hangs sometime too.

I'm going to try with ealier commits.

For me the problem can happen with any commit...

As it depends on the execution path and on the load and speed of the system it looks like a race condition.

Hah, thanks for checking. Maybe...

...

Did you try to test on a host with a kernel patched with "[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive" ?

Now I tried, and yes, the test doesn't hang anymore! I seem to have an issue with teardown functions on recent kernels (current net.git HEAD more or less): --- + teardown_migrate + cat /tmp/passt-tests-VVtLn0/migrate/qemu_1.pid + /home/sbrivio/passt/test/nstool exec /tmp/passt-tests-VVtLn0/migrate/ns1.hold -- kill 16 qemu-system-x86_64: terminating on signal 15 from pid 34 () + cat /tmp/passt-tests-VVtLn0/migrate/qemu_2.pid + /home/sbrivio/passt/test/nstool exec /tmp/passt-tests-VVtLn0/migrate/ns1.hold -- kill 15 18.8974: ================ Vhost user message ================ 18.8974: Request: VHOST_USER_GET_VRING_BASE (11) 18.8974: Flags: 0x1 18.8974: Size: 8 18.8974: State.index: 0 18.8975: ================ Vhost user message ================ 18.8975: Request: VHOST_USER_GET_VRING_BASE (11) 18.8975: Flags: 0x1 18.8975: Size: 8 18.8975: State.index: 1 qemu-system-x86_64: terminating on signal 15 from pid 35 () 18.7961: Client connection closed 18.7962: Closing TCP_REPAIR helper socket + context_wait qemu_1 + __name=qemu_1 + __pidfile=/tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + cat /tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_2.stdout.9pwpVbQr /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_2.stderr.dSY5hBu1 + __pid=67766 + rm /tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n passt_repair_2$ + return 0 18.9016: Client connection closed 18.9018: Closing TCP_REPAIR helper socket + wait 67766 + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_1.stdout.JEyDGxXe /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_1.stderr.WU550iEI + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n passt_repair_1$ + return 0 + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_qemu_2.stdout.Dm8EAhfl /tmp/passt-tests-VVtLn0/migrate/context_qemu_2.stderr.207qJYPA + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n qemu_2$ + return 0 2026/05/22 04:08:23 socat[73089] E connect(5, AF=40 cid:94558 port:22, 16): Connection timed out Connection closed by UNKNOWN port 65535 ... --- it looks like we stop QEMU a bit too early. But it should be unrelated. I'm now trying to find some kind of workaround for existing (not fixed) kernel versions. Maybe stopping rampstream_in for a moment or something like that. -- Stefano

Stefano Brivio

7:44 a.m.

On Fri, 22 May 2026 06:22:39 +0200 Stefano Brivio wrote:

...

On Fri, 22 May 2026 01:13:33 +0200 Laurent Vivier wrote:

...
On 5/21/26 10:30, Laurent Vivier wrote:

...
On 5/20/26 22:53, Stefano Brivio wrote:

...
On Wed, 20 May 2026 18:18:52 +0200 Stefano Brivio wrote:

...
On Wed, 20 May 2026 18:07:08 +0200 Stefano Brivio wrote:

...
On Wed, 20 May 2026 17:34:45 +0200 Stefano Brivio wrote: > On Wed, 13 May 2026 13:52:08 +0200 > Laurent Vivier wrote: >> Currently, the vhost-user path assumes each virtqueue element contains >> exactly one iovec entry covering the entire frame. This assumption >> breaks as some virtio-net drivers (notably iPXE) provide descriptors where the >> vnet header and the frame payload are in separate buffers, resulting in >> two iovec entries per virtqueue element. >> >> This series refactors the vhost-user data path so that frame lengths, >> header sizes, and padding are tracked and passed explicitly rather than >> being derived from iovec sizes. This decoupling is a prerequisite for >> correctly handling padding of multi-buffer frames. > > Sorry to bring (likely) bad news, but this series seems to introduce a > regression: I got the migration/rampstream_in tests fail twice in a > row, which I've never saw happening (I think I saw a single failure a > long time ago when the machine had a high CPU load, but nothing else). > > I'm currently bisecting and the bisect seems to point towards the end > of the series (probably 10/10), but I haven't finished yet. I'll keep > you posted. I haven't spotted anything that might cause issues there.

Yeah, that's the one :(

$ git bisect bad db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit commit db798fc60f4c5869cb53168354e068fb4dabd91a Author: Laurent Vivier Date: Wed May 13 13:52:18 2026 +0200

vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()

I checked on my system with the commit previous to this series, bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not everytime).

...
TCP/IPv4: sequence check, ramps, inbound ...failed.

and rampstream_out hangs sometime too.

I'm going to try with ealier commits.

For me the problem can happen with any commit...

As it depends on the execution path and on the load and speed of the system it looks like a race condition.

Hah, thanks for checking. Maybe...

...
Did you try to test on a host with a kernel patched with "[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive" ?

Now I tried, and yes, the test doesn't hang anymore! I seem to have an issue with teardown functions on recent kernels (current net.git HEAD more or less):

--- + teardown_migrate + cat /tmp/passt-tests-VVtLn0/migrate/qemu_1.pid + /home/sbrivio/passt/test/nstool exec /tmp/passt-tests-VVtLn0/migrate/ns1.hold -- kill 16 qemu-system-x86_64: terminating on signal 15 from pid 34 () + cat /tmp/passt-tests-VVtLn0/migrate/qemu_2.pid + /home/sbrivio/passt/test/nstool exec /tmp/passt-tests-VVtLn0/migrate/ns1.hold -- kill 15 18.8974: ================ Vhost user message ================ 18.8974: Request: VHOST_USER_GET_VRING_BASE (11) 18.8974: Flags: 0x1 18.8974: Size: 8 18.8974: State.index: 0 18.8975: ================ Vhost user message ================ 18.8975: Request: VHOST_USER_GET_VRING_BASE (11) 18.8975: Flags: 0x1 18.8975: Size: 8 18.8975: State.index: 1 qemu-system-x86_64: terminating on signal 15 from pid 35 () 18.7961: Client connection closed 18.7962: Closing TCP_REPAIR helper socket + context_wait qemu_1 + __name=qemu_1 + __pidfile=/tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + cat /tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_2.stdout.9pwpVbQr /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_2.stderr.dSY5hBu1 + __pid=67766 + rm /tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n passt_repair_2$ + return 0 18.9016: Client connection closed 18.9018: Closing TCP_REPAIR helper socket + wait 67766 + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_1.stdout.JEyDGxXe /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_1.stderr.WU550iEI + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n passt_repair_1$ + return 0 + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_qemu_2.stdout.Dm8EAhfl /tmp/passt-tests-VVtLn0/migrate/context_qemu_2.stderr.207qJYPA + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n qemu_2$ + return 0 2026/05/22 04:08:23 socat[73089] E connect(5, AF=40 cid:94558 port:22, 16): Connection timed out Connection closed by UNKNOWN port 65535 ... ---

it looks like we stop QEMU a bit too early. But it should be unrelated.

I'm now trying to find some kind of workaround for existing (not fixed) kernel versions. Maybe stopping rampstream_in for a moment or something like that.

For some weird reason even very blatant throttling (100 ms - 1 s delays every 10000 ramps, or an explicit 500 ms pause via signal before migration) doesn't help. So it doesn't seem to be *that* kind of race. I should probably check the same exact kernel version with fix and without... -- Stefano

David GIbson

8:15 a.m.

On Fri, May 22, 2026 at 07:44:56AM +0200, Stefano Brivio wrote:

...

On Fri, 22 May 2026 06:22:39 +0200 Stefano Brivio wrote:

...
On Fri, 22 May 2026 01:13:33 +0200 Laurent Vivier wrote:

...
On 5/21/26 10:30, Laurent Vivier wrote:

...
On 5/20/26 22:53, Stefano Brivio wrote:

...
On Wed, 20 May 2026 18:18:52 +0200 Stefano Brivio wrote:

...
On Wed, 20 May 2026 18:07:08 +0200 Stefano Brivio wrote:

> On Wed, 20 May 2026 17:34:45 +0200 > Stefano Brivio wrote: >> On Wed, 13 May 2026 13:52:08 +0200 >> Laurent Vivier wrote: >>> Currently, the vhost-user path assumes each virtqueue element contains >>> exactly one iovec entry covering the entire frame. This assumption >>> breaks as some virtio-net drivers (notably iPXE) provide descriptors where the >>> vnet header and the frame payload are in separate buffers, resulting in >>> two iovec entries per virtqueue element. >>> >>> This series refactors the vhost-user data path so that frame lengths, >>> header sizes, and padding are tracked and passed explicitly rather than >>> being derived from iovec sizes. This decoupling is a prerequisite for >>> correctly handling padding of multi-buffer frames. >> >> Sorry to bring (likely) bad news, but this series seems to introduce a >> regression: I got the migration/rampstream_in tests fail twice in a >> row, which I've never saw happening (I think I saw a single failure a >> long time ago when the machine had a high CPU load, but nothing else). >> >> I'm currently bisecting and the bisect seems to point towards the end >> of the series (probably 10/10), but I haven't finished yet. I'll keep >> you posted. I haven't spotted anything that might cause issues there. > > Yeah, that's the one :( > > $ git bisect bad > db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit > commit db798fc60f4c5869cb53168354e068fb4dabd91a > Author: Laurent Vivier > Date: Wed May 13 13:52:18 2026 +0200 > > vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()

I checked on my system with the commit previous to this series, bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not everytime).

...
TCP/IPv4: sequence check, ramps, inbound ...failed.

and rampstream_out hangs sometime too.

I'm going to try with ealier commits.

For me the problem can happen with any commit...

As it depends on the execution path and on the load and speed of the system it looks like a race condition.

Hah, thanks for checking. Maybe...

...
Did you try to test on a host with a kernel patched with "[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive" ?

Now I tried, and yes, the test doesn't hang anymore! I seem to have an issue with teardown functions on recent kernels (current net.git HEAD more or less):

--- + teardown_migrate + cat /tmp/passt-tests-VVtLn0/migrate/qemu_1.pid + /home/sbrivio/passt/test/nstool exec /tmp/passt-tests-VVtLn0/migrate/ns1.hold -- kill 16 qemu-system-x86_64: terminating on signal 15 from pid 34 () + cat /tmp/passt-tests-VVtLn0/migrate/qemu_2.pid + /home/sbrivio/passt/test/nstool exec /tmp/passt-tests-VVtLn0/migrate/ns1.hold -- kill 15 18.8974: ================ Vhost user message ================ 18.8974: Request: VHOST_USER_GET_VRING_BASE (11) 18.8974: Flags: 0x1 18.8974: Size: 8 18.8974: State.index: 0 18.8975: ================ Vhost user message ================ 18.8975: Request: VHOST_USER_GET_VRING_BASE (11) 18.8975: Flags: 0x1 18.8975: Size: 8 18.8975: State.index: 1 qemu-system-x86_64: terminating on signal 15 from pid 35 () 18.7961: Client connection closed 18.7962: Closing TCP_REPAIR helper socket + context_wait qemu_1 + __name=qemu_1 + __pidfile=/tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + cat /tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_2.stdout.9pwpVbQr /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_2.stderr.dSY5hBu1 + __pid=67766 + rm /tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n passt_repair_2$ + return 0 18.9016: Client connection closed 18.9018: Closing TCP_REPAIR helper socket + wait 67766 + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_1.stdout.JEyDGxXe /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_1.stderr.WU550iEI + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n passt_repair_1$ + return 0 + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_qemu_2.stdout.Dm8EAhfl /tmp/passt-tests-VVtLn0/migrate/context_qemu_2.stderr.207qJYPA + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n qemu_2$ + return 0 2026/05/22 04:08:23 socat[73089] E connect(5, AF=40 cid:94558 port:22, 16): Connection timed out Connection closed by UNKNOWN port 65535 ... ---

it looks like we stop QEMU a bit too early. But it should be unrelated.

I'm now trying to find some kind of workaround for existing (not fixed) kernel versions. Maybe stopping rampstream_in for a moment or something like that.

For some weird reason even very blatant throttling (100 ms - 1 s delays every 10000 ramps, or an explicit 500 ms pause via signal before migration) doesn't help.

So it doesn't seem to be *that* kind of race. I should probably check the same exact kernel version with fix and without...

If it's due to the kernel not stopping the queues on REPAIR, then the only real way to fix the test is to cut off the source machine's network before we trigger migration. That could be done with netfilter (in a user+netns). But probably more natural would be to not do the migration between local passt instances, but actually between two host namespaces, with separate netifs for external connectivity and for the migration. Remove the external netif on the source, then trigger migration, then add the external netif on the destination. It's quite a bit of hassle :(. But it does model something much closer to a real migration scenario. As a bonus it would mean we'd no longer rely on the hack of guessing when to exit the source passt in order to allow the destination passt to bind. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

8:23 a.m.

On Fri, 22 May 2026 16:15:08 +1000 David GIbson wrote:

...

On Fri, May 22, 2026 at 07:44:56AM +0200, Stefano Brivio wrote:

...
On Fri, 22 May 2026 06:22:39 +0200 Stefano Brivio wrote:

...
On Fri, 22 May 2026 01:13:33 +0200 Laurent Vivier wrote:

...
On 5/21/26 10:30, Laurent Vivier wrote:

...
On 5/20/26 22:53, Stefano Brivio wrote:

...
On Wed, 20 May 2026 18:18:52 +0200 Stefano Brivio wrote:

> On Wed, 20 May 2026 18:07:08 +0200 > Stefano Brivio wrote: > >> On Wed, 20 May 2026 17:34:45 +0200 >> Stefano Brivio wrote: >>> On Wed, 13 May 2026 13:52:08 +0200 >>> Laurent Vivier wrote: >>>> Currently, the vhost-user path assumes each virtqueue element contains >>>> exactly one iovec entry covering the entire frame. This assumption >>>> breaks as some virtio-net drivers (notably iPXE) provide descriptors where the >>>> vnet header and the frame payload are in separate buffers, resulting in >>>> two iovec entries per virtqueue element. >>>> >>>> This series refactors the vhost-user data path so that frame lengths, >>>> header sizes, and padding are tracked and passed explicitly rather than >>>> being derived from iovec sizes. This decoupling is a prerequisite for >>>> correctly handling padding of multi-buffer frames. >>> >>> Sorry to bring (likely) bad news, but this series seems to introduce a >>> regression: I got the migration/rampstream_in tests fail twice in a >>> row, which I've never saw happening (I think I saw a single failure a >>> long time ago when the machine had a high CPU load, but nothing else). >>> >>> I'm currently bisecting and the bisect seems to point towards the end >>> of the series (probably 10/10), but I haven't finished yet. I'll keep >>> you posted. I haven't spotted anything that might cause issues there. >> >> Yeah, that's the one :( >> >> $ git bisect bad >> db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit >> commit db798fc60f4c5869cb53168354e068fb4dabd91a >> Author: Laurent Vivier >> Date: Wed May 13 13:52:18 2026 +0200 >> >> vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()

I checked on my system with the commit previous to this series, bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not everytime).

...
TCP/IPv4: sequence check, ramps, inbound ...failed.

and rampstream_out hangs sometime too.

I'm going to try with ealier commits.

For me the problem can happen with any commit...

As it depends on the execution path and on the load and speed of the system it looks like a race condition.

Hah, thanks for checking. Maybe...

...
Did you try to test on a host with a kernel patched with "[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive" ?

Now I tried, and yes, the test doesn't hang anymore! I seem to have an issue with teardown functions on recent kernels (current net.git HEAD more or less):

--- + teardown_migrate + cat /tmp/passt-tests-VVtLn0/migrate/qemu_1.pid + /home/sbrivio/passt/test/nstool exec /tmp/passt-tests-VVtLn0/migrate/ns1.hold -- kill 16 qemu-system-x86_64: terminating on signal 15 from pid 34 () + cat /tmp/passt-tests-VVtLn0/migrate/qemu_2.pid + /home/sbrivio/passt/test/nstool exec /tmp/passt-tests-VVtLn0/migrate/ns1.hold -- kill 15 18.8974: ================ Vhost user message ================ 18.8974: Request: VHOST_USER_GET_VRING_BASE (11) 18.8974: Flags: 0x1 18.8974: Size: 8 18.8974: State.index: 0 18.8975: ================ Vhost user message ================ 18.8975: Request: VHOST_USER_GET_VRING_BASE (11) 18.8975: Flags: 0x1 18.8975: Size: 8 18.8975: State.index: 1 qemu-system-x86_64: terminating on signal 15 from pid 35 () 18.7961: Client connection closed 18.7962: Closing TCP_REPAIR helper socket + context_wait qemu_1 + __name=qemu_1 + __pidfile=/tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + cat /tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_2.stdout.9pwpVbQr /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_2.stderr.dSY5hBu1 + __pid=67766 + rm /tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n passt_repair_2$ + return 0 18.9016: Client connection closed 18.9018: Closing TCP_REPAIR helper socket + wait 67766 + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_1.stdout.JEyDGxXe /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_1.stderr.WU550iEI + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n passt_repair_1$ + return 0 + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_qemu_2.stdout.Dm8EAhfl /tmp/passt-tests-VVtLn0/migrate/context_qemu_2.stderr.207qJYPA + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n qemu_2$ + return 0 2026/05/22 04:08:23 socat[73089] E connect(5, AF=40 cid:94558 port:22, 16): Connection timed out Connection closed by UNKNOWN port 65535 ... ---

it looks like we stop QEMU a bit too early. But it should be unrelated.

I'm now trying to find some kind of workaround for existing (not fixed) kernel versions. Maybe stopping rampstream_in for a moment or something like that.

For some weird reason even very blatant throttling (100 ms - 1 s delays every 10000 ramps, or an explicit 500 ms pause via signal before migration) doesn't help.

So it doesn't seem to be *that* kind of race. I should probably check the same exact kernel version with fix and without...

If it's due to the kernel not stopping the queues on REPAIR, then the only real way to fix the test is to cut off the source machine's network before we trigger migration.

Well, that's a rather complicated way to do it. One could simply stop the traffic instead. But it doesn't help, so there's probably another issue.

...

That could be done with netfilter (in a user+netns). But probably more natural would be to not do the migration between local passt instances, but actually between two host namespaces, with separate netifs for external connectivity and for the migration. Remove the external netif on the source, then trigger migration, then add the external netif on the destination.

It's quite a bit of hassle :(. But it does model something much closer to a real migration scenario. As a bonus it would mean we'd no longer rely on the hack of guessing when to exit the source passt in order to allow the destination passt to bind.

I struggle to see how that would be worth the investment, especially if we're working around a kernel issue that should eventually be fixed. Or, at least, right now, I'm just trying to get tests to pass while keeping Laurent changes in the tree.. -- Stefano

David GIbson

8:36 a.m.

On Fri, May 22, 2026 at 08:23:50AM +0200, Stefano Brivio wrote:

...

On Fri, 22 May 2026 16:15:08 +1000 David GIbson wrote:

...
On Fri, May 22, 2026 at 07:44:56AM +0200, Stefano Brivio wrote:

...
On Fri, 22 May 2026 06:22:39 +0200 Stefano Brivio wrote:

...
On Fri, 22 May 2026 01:13:33 +0200 Laurent Vivier wrote:

...
On 5/21/26 10:30, Laurent Vivier wrote:

...
On 5/20/26 22:53, Stefano Brivio wrote: > On Wed, 20 May 2026 18:18:52 +0200 > Stefano Brivio wrote: > >> On Wed, 20 May 2026 18:07:08 +0200 >> Stefano Brivio wrote: >> >>> On Wed, 20 May 2026 17:34:45 +0200 >>> Stefano Brivio wrote: >>>> On Wed, 13 May 2026 13:52:08 +0200 >>>> Laurent Vivier wrote: >>>>> Currently, the vhost-user path assumes each virtqueue element contains >>>>> exactly one iovec entry covering the entire frame. This assumption >>>>> breaks as some virtio-net drivers (notably iPXE) provide descriptors where the >>>>> vnet header and the frame payload are in separate buffers, resulting in >>>>> two iovec entries per virtqueue element. >>>>> >>>>> This series refactors the vhost-user data path so that frame lengths, >>>>> header sizes, and padding are tracked and passed explicitly rather than >>>>> being derived from iovec sizes. This decoupling is a prerequisite for >>>>> correctly handling padding of multi-buffer frames. >>>> >>>> Sorry to bring (likely) bad news, but this series seems to introduce a >>>> regression: I got the migration/rampstream_in tests fail twice in a >>>> row, which I've never saw happening (I think I saw a single failure a >>>> long time ago when the machine had a high CPU load, but nothing else). >>>> >>>> I'm currently bisecting and the bisect seems to point towards the end >>>> of the series (probably 10/10), but I haven't finished yet. I'll keep >>>> you posted. I haven't spotted anything that might cause issues there. >>> >>> Yeah, that's the one :( >>> >>> $ git bisect bad >>> db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit >>> commit db798fc60f4c5869cb53168354e068fb4dabd91a >>> Author: Laurent Vivier >>> Date: Wed May 13 13:52:18 2026 +0200 >>> >>> vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()

I checked on my system with the commit previous to this series, bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not everytime).

> TCP/IPv4: sequence check, ramps, inbound ...failed.

and rampstream_out hangs sometime too.

I'm going to try with ealier commits.

For me the problem can happen with any commit...

As it depends on the execution path and on the load and speed of the system it looks like a race condition.

Hah, thanks for checking. Maybe...

...
Did you try to test on a host with a kernel patched with "[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive" ?

Now I tried, and yes, the test doesn't hang anymore! I seem to have an issue with teardown functions on recent kernels (current net.git HEAD more or less):

--- + teardown_migrate + cat /tmp/passt-tests-VVtLn0/migrate/qemu_1.pid + /home/sbrivio/passt/test/nstool exec /tmp/passt-tests-VVtLn0/migrate/ns1.hold -- kill 16 qemu-system-x86_64: terminating on signal 15 from pid 34 () + cat /tmp/passt-tests-VVtLn0/migrate/qemu_2.pid + /home/sbrivio/passt/test/nstool exec /tmp/passt-tests-VVtLn0/migrate/ns1.hold -- kill 15 18.8974: ================ Vhost user message ================ 18.8974: Request: VHOST_USER_GET_VRING_BASE (11) 18.8974: Flags: 0x1 18.8974: Size: 8 18.8974: State.index: 0 18.8975: ================ Vhost user message ================ 18.8975: Request: VHOST_USER_GET_VRING_BASE (11) 18.8975: Flags: 0x1 18.8975: Size: 8 18.8975: State.index: 1 qemu-system-x86_64: terminating on signal 15 from pid 35 () 18.7961: Client connection closed 18.7962: Closing TCP_REPAIR helper socket + context_wait qemu_1 + __name=qemu_1 + __pidfile=/tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + cat /tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_2.stdout.9pwpVbQr /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_2.stderr.dSY5hBu1 + __pid=67766 + rm /tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n passt_repair_2$ + return 0 18.9016: Client connection closed 18.9018: Closing TCP_REPAIR helper socket + wait 67766 + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_1.stdout.JEyDGxXe /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_1.stderr.WU550iEI + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n passt_repair_1$ + return 0 + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_qemu_2.stdout.Dm8EAhfl /tmp/passt-tests-VVtLn0/migrate/context_qemu_2.stderr.207qJYPA + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n qemu_2$ + return 0 2026/05/22 04:08:23 socat[73089] E connect(5, AF=40 cid:94558 port:22, 16): Connection timed out Connection closed by UNKNOWN port 65535 ... ---

it looks like we stop QEMU a bit too early. But it should be unrelated.

I'm now trying to find some kind of workaround for existing (not fixed) kernel versions. Maybe stopping rampstream_in for a moment or something like that.

For some weird reason even very blatant throttling (100 ms - 1 s delays every 10000 ramps, or an explicit 500 ms pause via signal before migration) doesn't help.

So it doesn't seem to be *that* kind of race. I should probably check the same exact kernel version with fix and without...

If it's due to the kernel not stopping the queues on REPAIR, then the only real way to fix the test is to cut off the source machine's network before we trigger migration.

Well, that's a rather complicated way to do it. One could simply stop the traffic instead.

I don't know that "simply" is quite so simple. You can suspend the source of the data, but you need to wait a difficult to ascertain amount of time for that to make it to the guest, and all the acks to come back. For rampstream_out it's worse: the source is in the guest which isn't supposed to know about the migration in advance, so you can't really stop it without stopping the guest's whole network.

...

But it doesn't help, so there's probably another issue.

...
That could be done with netfilter (in a user+netns). But probably more natural would be to not do the migration between local passt instances, but actually between two host namespaces, with separate netifs for external connectivity and for the migration. Remove the external netif on the source, then trigger migration, then add the external netif on the destination.

It's quite a bit of hassle :(. But it does model something much closer to a real migration scenario. As a bonus it would mean we'd no longer rely on the hack of guessing when to exit the source passt in order to allow the destination passt to bind.

I struggle to see how that would be worth the investment, especially if we're working around a kernel issue that should eventually be fixed.

Or, at least, right now, I'm just trying to get tests to pass while keeping Laurent changes in the tree..

-- Stefano

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

8:45 a.m.

On Fri, 22 May 2026 16:36:52 +1000 David GIbson wrote:

...

On Fri, May 22, 2026 at 08:23:50AM +0200, Stefano Brivio wrote:

...
On Fri, 22 May 2026 16:15:08 +1000 David GIbson wrote:

...
On Fri, May 22, 2026 at 07:44:56AM +0200, Stefano Brivio wrote:

...
On Fri, 22 May 2026 06:22:39 +0200 Stefano Brivio wrote:

...
On Fri, 22 May 2026 01:13:33 +0200 Laurent Vivier wrote:

...
On 5/21/26 10:30, Laurent Vivier wrote: > On 5/20/26 22:53, Stefano Brivio wrote: >> On Wed, 20 May 2026 18:18:52 +0200 >> Stefano Brivio wrote: >> >>> On Wed, 20 May 2026 18:07:08 +0200 >>> Stefano Brivio wrote: >>> >>>> On Wed, 20 May 2026 17:34:45 +0200 >>>> Stefano Brivio wrote: >>>>> On Wed, 13 May 2026 13:52:08 +0200 >>>>> Laurent Vivier wrote: >>>>>> Currently, the vhost-user path assumes each virtqueue element contains >>>>>> exactly one iovec entry covering the entire frame. This assumption >>>>>> breaks as some virtio-net drivers (notably iPXE) provide descriptors where the >>>>>> vnet header and the frame payload are in separate buffers, resulting in >>>>>> two iovec entries per virtqueue element. >>>>>> >>>>>> This series refactors the vhost-user data path so that frame lengths, >>>>>> header sizes, and padding are tracked and passed explicitly rather than >>>>>> being derived from iovec sizes. This decoupling is a prerequisite for >>>>>> correctly handling padding of multi-buffer frames. >>>>> >>>>> Sorry to bring (likely) bad news, but this series seems to introduce a >>>>> regression: I got the migration/rampstream_in tests fail twice in a >>>>> row, which I've never saw happening (I think I saw a single failure a >>>>> long time ago when the machine had a high CPU load, but nothing else). >>>>> >>>>> I'm currently bisecting and the bisect seems to point towards the end >>>>> of the series (probably 10/10), but I haven't finished yet. I'll keep >>>>> you posted. I haven't spotted anything that might cause issues there. >>>> >>>> Yeah, that's the one :( >>>> >>>> $ git bisect bad >>>> db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit >>>> commit db798fc60f4c5869cb53168354e068fb4dabd91a >>>> Author: Laurent Vivier >>>> Date: Wed May 13 13:52:18 2026 +0200 >>>> >>>> vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad() > > I checked on my system with the commit previous to this series, > bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not > everytime). > > > TCP/IPv4: sequence check, ramps, inbound > ...failed. > > and rampstream_out hangs sometime too. > > I'm going to try with ealier commits.

For me the problem can happen with any commit...

As it depends on the execution path and on the load and speed of the system it looks like a race condition.

Hah, thanks for checking. Maybe...

...
Did you try to test on a host with a kernel patched with "[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive" ?

Now I tried, and yes, the test doesn't hang anymore! I seem to have an issue with teardown functions on recent kernels (current net.git HEAD more or less):

--- + teardown_migrate + cat /tmp/passt-tests-VVtLn0/migrate/qemu_1.pid + /home/sbrivio/passt/test/nstool exec /tmp/passt-tests-VVtLn0/migrate/ns1.hold -- kill 16 qemu-system-x86_64: terminating on signal 15 from pid 34 () + cat /tmp/passt-tests-VVtLn0/migrate/qemu_2.pid + /home/sbrivio/passt/test/nstool exec /tmp/passt-tests-VVtLn0/migrate/ns1.hold -- kill 15 18.8974: ================ Vhost user message ================ 18.8974: Request: VHOST_USER_GET_VRING_BASE (11) 18.8974: Flags: 0x1 18.8974: Size: 8 18.8974: State.index: 0 18.8975: ================ Vhost user message ================ 18.8975: Request: VHOST_USER_GET_VRING_BASE (11) 18.8975: Flags: 0x1 18.8975: Size: 8 18.8975: State.index: 1 qemu-system-x86_64: terminating on signal 15 from pid 35 () 18.7961: Client connection closed 18.7962: Closing TCP_REPAIR helper socket + context_wait qemu_1 + __name=qemu_1 + __pidfile=/tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + cat /tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_2.stdout.9pwpVbQr /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_2.stderr.dSY5hBu1 + __pid=67766 + rm /tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n passt_repair_2$ + return 0 18.9016: Client connection closed 18.9018: Closing TCP_REPAIR helper socket + wait 67766 + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_1.stdout.JEyDGxXe /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_1.stderr.WU550iEI + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n passt_repair_1$ + return 0 + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_qemu_2.stdout.Dm8EAhfl /tmp/passt-tests-VVtLn0/migrate/context_qemu_2.stderr.207qJYPA + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n qemu_2$ + return 0 2026/05/22 04:08:23 socat[73089] E connect(5, AF=40 cid:94558 port:22, 16): Connection timed out Connection closed by UNKNOWN port 65535 ... ---

it looks like we stop QEMU a bit too early. But it should be unrelated.

I'm now trying to find some kind of workaround for existing (not fixed) kernel versions. Maybe stopping rampstream_in for a moment or something like that.

For some weird reason even very blatant throttling (100 ms - 1 s delays every 10000 ramps, or an explicit 500 ms pause via signal before migration) doesn't help.

So it doesn't seem to be *that* kind of race. I should probably check the same exact kernel version with fix and without...

If it's due to the kernel not stopping the queues on REPAIR, then the only real way to fix the test is to cut off the source machine's network before we trigger migration.

Well, that's a rather complicated way to do it. One could simply stop the traffic instead.

I don't know that "simply" is quite so simple. You can suspend the source of the data, but you need to wait a difficult to ascertain amount of time for that to make it to the guest, and all the acks to come back.

Looking at captures that parts seems to be around 1-2 ms, so I'm waiting 100 ms.

...

For rampstream_out it's worse: the source is in the guest which isn't supposed to know about the migration in advance, so you can't really stop it without stopping the guest's whole network.

But we don't have a problem with that one. -- Stefano

Stefano Brivio

2:04 p.m.

On Fri, 22 May 2026 07:44:55 +0200 Stefano Brivio wrote:

...

On Fri, 22 May 2026 06:22:39 +0200 Stefano Brivio wrote:

...
On Fri, 22 May 2026 01:13:33 +0200 Laurent Vivier wrote:

...
On 5/21/26 10:30, Laurent Vivier wrote:

...
On 5/20/26 22:53, Stefano Brivio wrote:

...
On Wed, 20 May 2026 18:18:52 +0200 Stefano Brivio wrote:

...
On Wed, 20 May 2026 18:07:08 +0200 Stefano Brivio wrote:

> On Wed, 20 May 2026 17:34:45 +0200 > Stefano Brivio wrote: >> On Wed, 13 May 2026 13:52:08 +0200 >> Laurent Vivier wrote: >>> Currently, the vhost-user path assumes each virtqueue element contains >>> exactly one iovec entry covering the entire frame. This assumption >>> breaks as some virtio-net drivers (notably iPXE) provide descriptors where the >>> vnet header and the frame payload are in separate buffers, resulting in >>> two iovec entries per virtqueue element. >>> >>> This series refactors the vhost-user data path so that frame lengths, >>> header sizes, and padding are tracked and passed explicitly rather than >>> being derived from iovec sizes. This decoupling is a prerequisite for >>> correctly handling padding of multi-buffer frames. >> >> Sorry to bring (likely) bad news, but this series seems to introduce a >> regression: I got the migration/rampstream_in tests fail twice in a >> row, which I've never saw happening (I think I saw a single failure a >> long time ago when the machine had a high CPU load, but nothing else). >> >> I'm currently bisecting and the bisect seems to point towards the end >> of the series (probably 10/10), but I haven't finished yet. I'll keep >> you posted. I haven't spotted anything that might cause issues there. > > Yeah, that's the one :( > > $ git bisect bad > db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit > commit db798fc60f4c5869cb53168354e068fb4dabd91a > Author: Laurent Vivier > Date: Wed May 13 13:52:18 2026 +0200 > > vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()

I checked on my system with the commit previous to this series, bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not everytime).

...
TCP/IPv4: sequence check, ramps, inbound ...failed.

and rampstream_out hangs sometime too.

I'm going to try with ealier commits.

For me the problem can happen with any commit...

As it depends on the execution path and on the load and speed of the system it looks like a race condition.

Hah, thanks for checking. Maybe...

...
Did you try to test on a host with a kernel patched with "[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive" ?

Now I tried, and yes, the test doesn't hang anymore! I seem to have an issue with teardown functions on recent kernels (current net.git HEAD more or less):

--- [...]

2026/05/22 04:08:23 socat[73089] E connect(5, AF=40 cid:94558 port:22, 16): Connection timed out Connection closed by UNKNOWN port 65535 ... ---

it looks like we stop QEMU a bit too early. But it should be unrelated.

Oops, I forgot to upgrade QEMU on the virtual machine I was using to test those kernel builds, I had a somewhat outdated 8.1 version and it failed migration for unrelated reasons. It works with 11.0. Back to kernel versions: the "problem" is that with a recent net-next.git HEAD, with or without my fix, in a nested VM, the test always passes (20/20). And I can't easily test things non-nested. I guess could just skip that test for the moment from the set I run git push, and run it manually in the virtual machine, for the moment. But judging from captures (test_logs/pasta_1.pcap from PCAP=1 ./run) I'm fairly sure it's not *that* issue: 465 12.141763 192.0.2.1 → 88.198.0.164 58451 TCP [TCP Window Full] 34416 → 10001 [PSH, ACK] Seq=10002100 Ack=1 Win=65536 Len=58397 466 12.187195 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0 467 13.187281 192.0.2.1 → 88.198.0.164 4150 TCP 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096 last data transfer from client (rampstream): 468 13.187358 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0 everything acknowledged, migration starts now: 469 14.143217 fe80::f471:c3ff:fe10:4e45 → ff02::2 70 ICMPv6 Router Solicitation from f6:71:c3:10:4e:45 470 14.687123 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] [TCP Keep-Alive] 10001 → 34416 [ACK] Seq=0 Ack=10060497 Win=0 Len=0 migration completed: and we acknowledge the right sequence (10060497), so it didn't jump forward. But starting from this point: 471 14.687265 192.0.2.1 → 88.198.0.164 60 TCP 34416 → 10001 [ACK] Seq=10060497 Ack=1 Win=65536 Len=0 472 16.687412 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096 473 16.687450 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0 474 20.687650 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096 475 20.687692 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0 476 28.687817 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096 we keep advertising a zero window (that's the kernel doing it really), as if we were unable to dequeue data. I enabled --trace just for the target instance of passt, and I don't see anything suspicious there: 13.0958: Receiving 1 flows 13.0958: Flow 0 (NEW): FREE -> NEW 13.0958: Flow 0 (TCP connection): TGT -> TYPED 13.0958: Flow 0 (TCP connection): HOST [192.0.2.1]:49892 -> [88.198.0.164]:10001 => TAP [192.0.2.1]:49892 -> [88.198.0.164]:10001 13.0958: Flow 0 (TCP connection): Side 1 hash table insert: bucket: 138154 13.0958: Flow 0 (TCP connection): TYPED -> ACTIVE 13.0958: Flow 0 (TCP connection): HOST [192.0.2.1]:49892 -> [88.198.0.164]:10001 => TAP [192.0.2.1]:49892 -> [88.198.0.164]:10001 13.0959: Flow 0 (TCP connection): Extended migration data, socket 83 sequences send 3121929544 receive 1643895001 13.0959: Flow 0 (TCP connection): pending queues: send 0 not sent 0 receive 3500081 13.0959: Flow 0 (TCP connection): window: snd_wl1 1647395082 snd_wnd 65536 max 65536 rcv_wnd 0 rcv_wup 1647395082 13.0959: Flow 0 (TCP connection): SO_PEEK_OFF disabled offset=0 13.0985: Got packet, but RX virtqueue not usable yet 13.0985: Closing migration channel, fd: 82 13.0985: Closing TCP_REPAIR helper socket 13.0985: passt: epoll event on vhost-user command socket 77 (events: 0x00000001) then the usual VHOST_USER_CHECK_DEVICE_STATE and VHOST_USER_SET_VRING_ENABLE commands. After that, a tight loop of: 13.0986: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.0986: Got packet, but RX virtqueue not usable yet 13.0986: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.0986: Got packet, but RX virtqueue not usable yet until we go further with the vhost-user setup. I still see this message which I had never noticed (but I didn't try to bisect around it): 13.1006: ================ Vhost user message ================ 13.1006: Request: VHOST_USER_SET_VRING_ADDR (9) [...] 13.1006: Last avail index != used index: 3252 != 3027 and then after VHOST_USER_SET_VRING_CALL, and: 13.1008: passt: epoll event on vhost-user kick socket 78 (events: 0x00000001) 13.1008: vhost-user: got kick_data: 0000000000000001 idx: 1 it's just a tight loop of: 13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001) as if we weren't dequeueing anything from there. I start suspecting we might be hitting two different issues: perhaps things fail on your setup because of the kernel bug with TCP_REPAIR not freezing the queue, and they fail on my setup for some other reason. For me it's very deterministic though: with patch 10/10 things always fail, and without it they never fail. I guess I'll add more prints and check for more messages before/after that patch. -- Stefano

Laurent Vivier

26 May 26 May

9:31 a.m.

On 5/22/26 14:04, Stefano Brivio wrote:

...

On Fri, 22 May 2026 07:44:55 +0200 Stefano Brivio wrote:

...
On Fri, 22 May 2026 06:22:39 +0200 Stefano Brivio wrote:

...
On Fri, 22 May 2026 01:13:33 +0200 Laurent Vivier wrote:

...
On 5/21/26 10:30, Laurent Vivier wrote:

...
On 5/20/26 22:53, Stefano Brivio wrote:

...
On Wed, 20 May 2026 18:18:52 +0200 Stefano Brivio wrote:

> On Wed, 20 May 2026 18:07:08 +0200 > Stefano Brivio wrote: > >> On Wed, 20 May 2026 17:34:45 +0200 >> Stefano Brivio wrote: >>> On Wed, 13 May 2026 13:52:08 +0200 >>> Laurent Vivier wrote: >>>> Currently, the vhost-user path assumes each virtqueue element contains >>>> exactly one iovec entry covering the entire frame. This assumption >>>> breaks as some virtio-net drivers (notably iPXE) provide descriptors where the >>>> vnet header and the frame payload are in separate buffers, resulting in >>>> two iovec entries per virtqueue element. >>>> >>>> This series refactors the vhost-user data path so that frame lengths, >>>> header sizes, and padding are tracked and passed explicitly rather than >>>> being derived from iovec sizes. This decoupling is a prerequisite for >>>> correctly handling padding of multi-buffer frames. >>> >>> Sorry to bring (likely) bad news, but this series seems to introduce a >>> regression: I got the migration/rampstream_in tests fail twice in a >>> row, which I've never saw happening (I think I saw a single failure a >>> long time ago when the machine had a high CPU load, but nothing else). >>> >>> I'm currently bisecting and the bisect seems to point towards the end >>> of the series (probably 10/10), but I haven't finished yet. I'll keep >>> you posted. I haven't spotted anything that might cause issues there. >> >> Yeah, that's the one :( >> >> $ git bisect bad >> db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit >> commit db798fc60f4c5869cb53168354e068fb4dabd91a >> Author: Laurent Vivier >> Date: Wed May 13 13:52:18 2026 +0200 >> >> vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()

I checked on my system with the commit previous to this series, bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not everytime).

...
TCP/IPv4: sequence check, ramps, inbound ...failed.

and rampstream_out hangs sometime too.

I'm going to try with ealier commits.

For me the problem can happen with any commit...

As it depends on the execution path and on the load and speed of the system it looks like a race condition.

Hah, thanks for checking. Maybe...

...
Did you try to test on a host with a kernel patched with "[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive" ?

Now I tried, and yes, the test doesn't hang anymore! I seem to have an issue with teardown functions on recent kernels (current net.git HEAD more or less):

--- [...]

2026/05/22 04:08:23 socat[73089] E connect(5, AF=40 cid:94558 port:22, 16): Connection timed out Connection closed by UNKNOWN port 65535 ... ---

it looks like we stop QEMU a bit too early. But it should be unrelated.

Oops, I forgot to upgrade QEMU on the virtual machine I was using to test those kernel builds, I had a somewhat outdated 8.1 version and it failed migration for unrelated reasons. It works with 11.0.

Back to kernel versions: the "problem" is that with a recent net-next.git HEAD, with or without my fix, in a nested VM, the test always passes (20/20). And I can't easily test things non-nested.

I guess could just skip that test for the moment from the set I run git push, and run it manually in the virtual machine, for the moment.

But judging from captures (test_logs/pasta_1.pcap from PCAP=1 ./run) I'm fairly sure it's not *that* issue:

465 12.141763 192.0.2.1 → 88.198.0.164 58451 TCP [TCP Window Full] 34416 → 10001 [PSH, ACK] Seq=10002100 Ack=1 Win=65536 Len=58397 466 12.187195 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0 467 13.187281 192.0.2.1 → 88.198.0.164 4150 TCP 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096

last data transfer from client (rampstream):

468 13.187358 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0

everything acknowledged, migration starts now:

469 14.143217 fe80::f471:c3ff:fe10:4e45 → ff02::2 70 ICMPv6 Router Solicitation from f6:71:c3:10:4e:45 470 14.687123 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] [TCP Keep-Alive] 10001 → 34416 [ACK] Seq=0 Ack=10060497 Win=0 Len=0

migration completed: and we acknowledge the right sequence (10060497), so it didn't jump forward.

But starting from this point:

471 14.687265 192.0.2.1 → 88.198.0.164 60 TCP 34416 → 10001 [ACK] Seq=10060497 Ack=1 Win=65536 Len=0 472 16.687412 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096 473 16.687450 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0 474 20.687650 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096 475 20.687692 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0 476 28.687817 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096

we keep advertising a zero window (that's the kernel doing it really), as if we were unable to dequeue data.

I enabled --trace just for the target instance of passt, and I don't see anything suspicious there:

13.0958: Receiving 1 flows 13.0958: Flow 0 (NEW): FREE -> NEW 13.0958: Flow 0 (TCP connection): TGT -> TYPED 13.0958: Flow 0 (TCP connection): HOST [192.0.2.1]:49892 -> [88.198.0.164]:10001 => TAP [192.0.2.1]:49892 -> [88.198.0.164]:10001 13.0958: Flow 0 (TCP connection): Side 1 hash table insert: bucket: 138154 13.0958: Flow 0 (TCP connection): TYPED -> ACTIVE 13.0958: Flow 0 (TCP connection): HOST [192.0.2.1]:49892 -> [88.198.0.164]:10001 => TAP [192.0.2.1]:49892 -> [88.198.0.164]:10001 13.0959: Flow 0 (TCP connection): Extended migration data, socket 83 sequences send 3121929544 receive 1643895001 13.0959: Flow 0 (TCP connection): pending queues: send 0 not sent 0 receive 3500081 13.0959: Flow 0 (TCP connection): window: snd_wl1 1647395082 snd_wnd 65536 max 65536 rcv_wnd 0 rcv_wup 1647395082 13.0959: Flow 0 (TCP connection): SO_PEEK_OFF disabled offset=0 13.0985: Got packet, but RX virtqueue not usable yet 13.0985: Closing migration channel, fd: 82 13.0985: Closing TCP_REPAIR helper socket 13.0985: passt: epoll event on vhost-user command socket 77 (events: 0x00000001)

then the usual VHOST_USER_CHECK_DEVICE_STATE and VHOST_USER_SET_VRING_ENABLE commands. After that, a tight loop of:

13.0986: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.0986: Got packet, but RX virtqueue not usable yet 13.0986: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.0986: Got packet, but RX virtqueue not usable yet

until we go further with the vhost-user setup. I still see this message which I had never noticed (but I didn't try to bisect around it):

13.1006: ================ Vhost user message ================ 13.1006: Request: VHOST_USER_SET_VRING_ADDR (9) [...] 13.1006: Last avail index != used index: 3252 != 3027

and then after VHOST_USER_SET_VRING_CALL, and:

13.1008: passt: epoll event on vhost-user kick socket 78 (events: 0x00000001) 13.1008: vhost-user: got kick_data: 0000000000000001 idx: 1

it's just a tight loop of:

13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001)

as if we weren't dequeueing anything from there.

I start suspecting we might be hitting two different issues: perhaps things fail on your setup because of the kernel bug with TCP_REPAIR not freezing the queue, and they fail on my setup for some other reason.

For me it's very deterministic though: with patch 10/10 things always fail, and without it they never fail.

I guess I'll add more prints and check for more messages before/after that patch.

In fact there is a buffer leak because iov_skip_bytes() doesn't correctly compute the number of used elements and then we don't release all the unused buffers. I'm trying to fix that. Please try with series "[PATCH v7 0/4] vhost-user,tcp: Handle multiple iovec entries per virtqueue element" applied, it reworks this part. Thanks, Laurent

Stefano Brivio

9:59 a.m.

On Tue, 26 May 2026 09:31:51 +0200 Laurent Vivier wrote:

...

On 5/22/26 14:04, Stefano Brivio wrote:

...
On Fri, 22 May 2026 07:44:55 +0200 Stefano Brivio wrote:

...
On Fri, 22 May 2026 06:22:39 +0200 Stefano Brivio wrote:

...
On Fri, 22 May 2026 01:13:33 +0200 Laurent Vivier wrote:

...
On 5/21/26 10:30, Laurent Vivier wrote:

...
On 5/20/26 22:53, Stefano Brivio wrote: > On Wed, 20 May 2026 18:18:52 +0200 > Stefano Brivio wrote: > >> On Wed, 20 May 2026 18:07:08 +0200 >> Stefano Brivio wrote: >> >>> On Wed, 20 May 2026 17:34:45 +0200 >>> Stefano Brivio wrote: >>>> On Wed, 13 May 2026 13:52:08 +0200 >>>> Laurent Vivier wrote: >>>>> Currently, the vhost-user path assumes each virtqueue element contains >>>>> exactly one iovec entry covering the entire frame. This assumption >>>>> breaks as some virtio-net drivers (notably iPXE) provide descriptors where the >>>>> vnet header and the frame payload are in separate buffers, resulting in >>>>> two iovec entries per virtqueue element. >>>>> >>>>> This series refactors the vhost-user data path so that frame lengths, >>>>> header sizes, and padding are tracked and passed explicitly rather than >>>>> being derived from iovec sizes. This decoupling is a prerequisite for >>>>> correctly handling padding of multi-buffer frames. >>>> >>>> Sorry to bring (likely) bad news, but this series seems to introduce a >>>> regression: I got the migration/rampstream_in tests fail twice in a >>>> row, which I've never saw happening (I think I saw a single failure a >>>> long time ago when the machine had a high CPU load, but nothing else). >>>> >>>> I'm currently bisecting and the bisect seems to point towards the end >>>> of the series (probably 10/10), but I haven't finished yet. I'll keep >>>> you posted. I haven't spotted anything that might cause issues there. >>> >>> Yeah, that's the one :( >>> >>> $ git bisect bad >>> db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit >>> commit db798fc60f4c5869cb53168354e068fb4dabd91a >>> Author: Laurent Vivier >>> Date: Wed May 13 13:52:18 2026 +0200 >>> >>> vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()

I checked on my system with the commit previous to this series, bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not everytime).

> TCP/IPv4: sequence check, ramps, inbound ...failed.

and rampstream_out hangs sometime too.

I'm going to try with ealier commits.

For me the problem can happen with any commit...

As it depends on the execution path and on the load and speed of the system it looks like a race condition.

Hah, thanks for checking. Maybe...

...
Did you try to test on a host with a kernel patched with "[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive" ?

Now I tried, and yes, the test doesn't hang anymore! I seem to have an issue with teardown functions on recent kernels (current net.git HEAD more or less):

--- [...]

2026/05/22 04:08:23 socat[73089] E connect(5, AF=40 cid:94558 port:22, 16): Connection timed out Connection closed by UNKNOWN port 65535 ... ---

it looks like we stop QEMU a bit too early. But it should be unrelated.

Oops, I forgot to upgrade QEMU on the virtual machine I was using to test those kernel builds, I had a somewhat outdated 8.1 version and it failed migration for unrelated reasons. It works with 11.0.

Back to kernel versions: the "problem" is that with a recent net-next.git HEAD, with or without my fix, in a nested VM, the test always passes (20/20). And I can't easily test things non-nested.

I guess could just skip that test for the moment from the set I run git push, and run it manually in the virtual machine, for the moment.

But judging from captures (test_logs/pasta_1.pcap from PCAP=1 ./run) I'm fairly sure it's not *that* issue:

465 12.141763 192.0.2.1 → 88.198.0.164 58451 TCP [TCP Window Full] 34416 → 10001 [PSH, ACK] Seq=10002100 Ack=1 Win=65536 Len=58397 466 12.187195 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0 467 13.187281 192.0.2.1 → 88.198.0.164 4150 TCP 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096

last data transfer from client (rampstream):

468 13.187358 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0

everything acknowledged, migration starts now:

469 14.143217 fe80::f471:c3ff:fe10:4e45 → ff02::2 70 ICMPv6 Router Solicitation from f6:71:c3:10:4e:45 470 14.687123 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] [TCP Keep-Alive] 10001 → 34416 [ACK] Seq=0 Ack=10060497 Win=0 Len=0

migration completed: and we acknowledge the right sequence (10060497), so it didn't jump forward.

But starting from this point:

471 14.687265 192.0.2.1 → 88.198.0.164 60 TCP 34416 → 10001 [ACK] Seq=10060497 Ack=1 Win=65536 Len=0 472 16.687412 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096 473 16.687450 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0 474 20.687650 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096 475 20.687692 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0 476 28.687817 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096

we keep advertising a zero window (that's the kernel doing it really), as if we were unable to dequeue data.

I enabled --trace just for the target instance of passt, and I don't see anything suspicious there:

13.0958: Receiving 1 flows 13.0958: Flow 0 (NEW): FREE -> NEW 13.0958: Flow 0 (TCP connection): TGT -> TYPED 13.0958: Flow 0 (TCP connection): HOST [192.0.2.1]:49892 -> [88.198.0.164]:10001 => TAP [192.0.2.1]:49892 -> [88.198.0.164]:10001 13.0958: Flow 0 (TCP connection): Side 1 hash table insert: bucket: 138154 13.0958: Flow 0 (TCP connection): TYPED -> ACTIVE 13.0958: Flow 0 (TCP connection): HOST [192.0.2.1]:49892 -> [88.198.0.164]:10001 => TAP [192.0.2.1]:49892 -> [88.198.0.164]:10001 13.0959: Flow 0 (TCP connection): Extended migration data, socket 83 sequences send 3121929544 receive 1643895001 13.0959: Flow 0 (TCP connection): pending queues: send 0 not sent 0 receive 3500081 13.0959: Flow 0 (TCP connection): window: snd_wl1 1647395082 snd_wnd 65536 max 65536 rcv_wnd 0 rcv_wup 1647395082 13.0959: Flow 0 (TCP connection): SO_PEEK_OFF disabled offset=0 13.0985: Got packet, but RX virtqueue not usable yet 13.0985: Closing migration channel, fd: 82 13.0985: Closing TCP_REPAIR helper socket 13.0985: passt: epoll event on vhost-user command socket 77 (events: 0x00000001)

then the usual VHOST_USER_CHECK_DEVICE_STATE and VHOST_USER_SET_VRING_ENABLE commands. After that, a tight loop of:

13.0986: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.0986: Got packet, but RX virtqueue not usable yet 13.0986: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.0986: Got packet, but RX virtqueue not usable yet

until we go further with the vhost-user setup. I still see this message which I had never noticed (but I didn't try to bisect around it):

13.1006: ================ Vhost user message ================ 13.1006: Request: VHOST_USER_SET_VRING_ADDR (9) [...] 13.1006: Last avail index != used index: 3252 != 3027

and then after VHOST_USER_SET_VRING_CALL, and:

13.1008: passt: epoll event on vhost-user kick socket 78 (events: 0x00000001) 13.1008: vhost-user: got kick_data: 0000000000000001 idx: 1

it's just a tight loop of:

13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001)

as if we weren't dequeueing anything from there.

I start suspecting we might be hitting two different issues: perhaps things fail on your setup because of the kernel bug with TCP_REPAIR not freezing the queue, and they fail on my setup for some other reason.

For me it's very deterministic though: with patch 10/10 things always fail, and without it they never fail.

I guess I'll add more prints and check for more messages before/after that patch.

In fact there is a buffer leak because iov_skip_bytes() doesn't correctly compute the number of used elements and then we don't release all the unused buffers.

I'm trying to fix that.

Please try with series "[PATCH v7 0/4] vhost-user,tcp: Handle multiple iovec entries per virtqueue element" applied, it reworks this part.

I'm trying it now. If that totally reworks this part and it fixes things and it's ready to be merged (sorry, I didn't manage to have a look yet) I don't think it's strictly necessary to figure out the leak. -- Stefano

Stefano Brivio

10:38 a.m.

On Tue, 26 May 2026 09:59:55 +0200 Stefano Brivio wrote:

...

On Tue, 26 May 2026 09:31:51 +0200 Laurent Vivier wrote:

...
On 5/22/26 14:04, Stefano Brivio wrote:

...
On Fri, 22 May 2026 07:44:55 +0200 Stefano Brivio wrote:

...
On Fri, 22 May 2026 06:22:39 +0200 Stefano Brivio wrote:

...
On Fri, 22 May 2026 01:13:33 +0200 Laurent Vivier wrote:

...
On 5/21/26 10:30, Laurent Vivier wrote: > On 5/20/26 22:53, Stefano Brivio wrote: >> On Wed, 20 May 2026 18:18:52 +0200 >> Stefano Brivio wrote: >> >>> On Wed, 20 May 2026 18:07:08 +0200 >>> Stefano Brivio wrote: >>> >>>> On Wed, 20 May 2026 17:34:45 +0200 >>>> Stefano Brivio wrote: >>>>> On Wed, 13 May 2026 13:52:08 +0200 >>>>> Laurent Vivier wrote: >>>>>> Currently, the vhost-user path assumes each virtqueue element contains >>>>>> exactly one iovec entry covering the entire frame. This assumption >>>>>> breaks as some virtio-net drivers (notably iPXE) provide descriptors where the >>>>>> vnet header and the frame payload are in separate buffers, resulting in >>>>>> two iovec entries per virtqueue element. >>>>>> >>>>>> This series refactors the vhost-user data path so that frame lengths, >>>>>> header sizes, and padding are tracked and passed explicitly rather than >>>>>> being derived from iovec sizes. This decoupling is a prerequisite for >>>>>> correctly handling padding of multi-buffer frames. >>>>> >>>>> Sorry to bring (likely) bad news, but this series seems to introduce a >>>>> regression: I got the migration/rampstream_in tests fail twice in a >>>>> row, which I've never saw happening (I think I saw a single failure a >>>>> long time ago when the machine had a high CPU load, but nothing else). >>>>> >>>>> I'm currently bisecting and the bisect seems to point towards the end >>>>> of the series (probably 10/10), but I haven't finished yet. I'll keep >>>>> you posted. I haven't spotted anything that might cause issues there. >>>> >>>> Yeah, that's the one :( >>>> >>>> $ git bisect bad >>>> db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit >>>> commit db798fc60f4c5869cb53168354e068fb4dabd91a >>>> Author: Laurent Vivier >>>> Date: Wed May 13 13:52:18 2026 +0200 >>>> >>>> vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad() > > I checked on my system with the commit previous to this series, > bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not > everytime). > > > TCP/IPv4: sequence check, ramps, inbound > ...failed. > > and rampstream_out hangs sometime too. > > I'm going to try with ealier commits.

For me the problem can happen with any commit...

As it depends on the execution path and on the load and speed of the system it looks like a race condition.

Hah, thanks for checking. Maybe...

...
Did you try to test on a host with a kernel patched with "[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive" ?

Now I tried, and yes, the test doesn't hang anymore! I seem to have an issue with teardown functions on recent kernels (current net.git HEAD more or less):

--- [...]

2026/05/22 04:08:23 socat[73089] E connect(5, AF=40 cid:94558 port:22, 16): Connection timed out Connection closed by UNKNOWN port 65535 ... ---

it looks like we stop QEMU a bit too early. But it should be unrelated.

Oops, I forgot to upgrade QEMU on the virtual machine I was using to test those kernel builds, I had a somewhat outdated 8.1 version and it failed migration for unrelated reasons. It works with 11.0.

Back to kernel versions: the "problem" is that with a recent net-next.git HEAD, with or without my fix, in a nested VM, the test always passes (20/20). And I can't easily test things non-nested.

I guess could just skip that test for the moment from the set I run git push, and run it manually in the virtual machine, for the moment.

But judging from captures (test_logs/pasta_1.pcap from PCAP=1 ./run) I'm fairly sure it's not *that* issue:

465 12.141763 192.0.2.1 → 88.198.0.164 58451 TCP [TCP Window Full] 34416 → 10001 [PSH, ACK] Seq=10002100 Ack=1 Win=65536 Len=58397 466 12.187195 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0 467 13.187281 192.0.2.1 → 88.198.0.164 4150 TCP 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096

last data transfer from client (rampstream):

468 13.187358 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0

everything acknowledged, migration starts now:

469 14.143217 fe80::f471:c3ff:fe10:4e45 → ff02::2 70 ICMPv6 Router Solicitation from f6:71:c3:10:4e:45 470 14.687123 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] [TCP Keep-Alive] 10001 → 34416 [ACK] Seq=0 Ack=10060497 Win=0 Len=0

migration completed: and we acknowledge the right sequence (10060497), so it didn't jump forward.

But starting from this point:

471 14.687265 192.0.2.1 → 88.198.0.164 60 TCP 34416 → 10001 [ACK] Seq=10060497 Ack=1 Win=65536 Len=0 472 16.687412 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096 473 16.687450 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0 474 20.687650 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096 475 20.687692 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0 476 28.687817 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096

we keep advertising a zero window (that's the kernel doing it really), as if we were unable to dequeue data.

I enabled --trace just for the target instance of passt, and I don't see anything suspicious there:

13.0958: Receiving 1 flows 13.0958: Flow 0 (NEW): FREE -> NEW 13.0958: Flow 0 (TCP connection): TGT -> TYPED 13.0958: Flow 0 (TCP connection): HOST [192.0.2.1]:49892 -> [88.198.0.164]:10001 => TAP [192.0.2.1]:49892 -> [88.198.0.164]:10001 13.0958: Flow 0 (TCP connection): Side 1 hash table insert: bucket: 138154 13.0958: Flow 0 (TCP connection): TYPED -> ACTIVE 13.0958: Flow 0 (TCP connection): HOST [192.0.2.1]:49892 -> [88.198.0.164]:10001 => TAP [192.0.2.1]:49892 -> [88.198.0.164]:10001 13.0959: Flow 0 (TCP connection): Extended migration data, socket 83 sequences send 3121929544 receive 1643895001 13.0959: Flow 0 (TCP connection): pending queues: send 0 not sent 0 receive 3500081 13.0959: Flow 0 (TCP connection): window: snd_wl1 1647395082 snd_wnd 65536 max 65536 rcv_wnd 0 rcv_wup 1647395082 13.0959: Flow 0 (TCP connection): SO_PEEK_OFF disabled offset=0 13.0985: Got packet, but RX virtqueue not usable yet 13.0985: Closing migration channel, fd: 82 13.0985: Closing TCP_REPAIR helper socket 13.0985: passt: epoll event on vhost-user command socket 77 (events: 0x00000001)

then the usual VHOST_USER_CHECK_DEVICE_STATE and VHOST_USER_SET_VRING_ENABLE commands. After that, a tight loop of:

13.0986: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.0986: Got packet, but RX virtqueue not usable yet 13.0986: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.0986: Got packet, but RX virtqueue not usable yet

until we go further with the vhost-user setup. I still see this message which I had never noticed (but I didn't try to bisect around it):

13.1006: ================ Vhost user message ================ 13.1006: Request: VHOST_USER_SET_VRING_ADDR (9) [...] 13.1006: Last avail index != used index: 3252 != 3027

and then after VHOST_USER_SET_VRING_CALL, and:

13.1008: passt: epoll event on vhost-user kick socket 78 (events: 0x00000001) 13.1008: vhost-user: got kick_data: 0000000000000001 idx: 1

it's just a tight loop of:

13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001)

as if we weren't dequeueing anything from there.

I start suspecting we might be hitting two different issues: perhaps things fail on your setup because of the kernel bug with TCP_REPAIR not freezing the queue, and they fail on my setup for some other reason.

For me it's very deterministic though: with patch 10/10 things always fail, and without it they never fail.

I guess I'll add more prints and check for more messages before/after that patch.

In fact there is a buffer leak because iov_skip_bytes() doesn't correctly compute the number of used elements and then we don't release all the unused buffers.

I'm trying to fix that.

Please try with series "[PATCH v7 0/4] vhost-user,tcp: Handle multiple iovec entries per virtqueue element" applied, it reworks this part.

I'm trying it now. If that totally reworks this part and it fixes things and it's ready to be merged (sorry, I didn't manage to have a look yet) I don't think it's strictly necessary to figure out the leak.

All tests pass with it, rampstream_in passed 20/20 times. Should I go ahead and merge both series (UDP and TCP, they both look ready) or do you still need to figure out the buffer leak first for other reasons? -- Stefano

Laurent Vivier

10:54 a.m.

On 5/26/26 10:38, Stefano Brivio wrote:

...

On Tue, 26 May 2026 09:59:55 +0200 Stefano Brivio wrote:

...
On Tue, 26 May 2026 09:31:51 +0200 Laurent Vivier wrote:

...
On 5/22/26 14:04, Stefano Brivio wrote:

...
On Fri, 22 May 2026 07:44:55 +0200 Stefano Brivio wrote:

...
On Fri, 22 May 2026 06:22:39 +0200 Stefano Brivio wrote:

...
On Fri, 22 May 2026 01:13:33 +0200 Laurent Vivier wrote:

> On 5/21/26 10:30, Laurent Vivier wrote: >> On 5/20/26 22:53, Stefano Brivio wrote: >>> On Wed, 20 May 2026 18:18:52 +0200 >>> Stefano Brivio wrote: >>> >>>> On Wed, 20 May 2026 18:07:08 +0200 >>>> Stefano Brivio wrote: >>>> >>>>> On Wed, 20 May 2026 17:34:45 +0200 >>>>> Stefano Brivio wrote: >>>>>> On Wed, 13 May 2026 13:52:08 +0200 >>>>>> Laurent Vivier wrote: >>>>>>> Currently, the vhost-user path assumes each virtqueue element contains >>>>>>> exactly one iovec entry covering the entire frame. This assumption >>>>>>> breaks as some virtio-net drivers (notably iPXE) provide descriptors where the >>>>>>> vnet header and the frame payload are in separate buffers, resulting in >>>>>>> two iovec entries per virtqueue element. >>>>>>> >>>>>>> This series refactors the vhost-user data path so that frame lengths, >>>>>>> header sizes, and padding are tracked and passed explicitly rather than >>>>>>> being derived from iovec sizes. This decoupling is a prerequisite for >>>>>>> correctly handling padding of multi-buffer frames. >>>>>> >>>>>> Sorry to bring (likely) bad news, but this series seems to introduce a >>>>>> regression: I got the migration/rampstream_in tests fail twice in a >>>>>> row, which I've never saw happening (I think I saw a single failure a >>>>>> long time ago when the machine had a high CPU load, but nothing else). >>>>>> >>>>>> I'm currently bisecting and the bisect seems to point towards the end >>>>>> of the series (probably 10/10), but I haven't finished yet. I'll keep >>>>>> you posted. I haven't spotted anything that might cause issues there. >>>>> >>>>> Yeah, that's the one :( >>>>> >>>>> $ git bisect bad >>>>> db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit >>>>> commit db798fc60f4c5869cb53168354e068fb4dabd91a >>>>> Author: Laurent Vivier >>>>> Date: Wed May 13 13:52:18 2026 +0200 >>>>> >>>>> vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad() >> >> I checked on my system with the commit previous to this series, >> bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not >> everytime). >> >> > TCP/IPv4: sequence check, ramps, inbound >> ...failed. >> >> and rampstream_out hangs sometime too. >> >> I'm going to try with ealier commits. > > For me the problem can happen with any commit... > > As it depends on the execution path and on the load and speed of the system it looks like > a race condition.

Hah, thanks for checking. Maybe...

> Did you try to test on a host with a kernel patched with > "[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive" ?

Now I tried, and yes, the test doesn't hang anymore! I seem to have an issue with teardown functions on recent kernels (current net.git HEAD more or less):

--- [...]

2026/05/22 04:08:23 socat[73089] E connect(5, AF=40 cid:94558 port:22, 16): Connection timed out Connection closed by UNKNOWN port 65535 ... ---

it looks like we stop QEMU a bit too early. But it should be unrelated.

Oops, I forgot to upgrade QEMU on the virtual machine I was using to test those kernel builds, I had a somewhat outdated 8.1 version and it failed migration for unrelated reasons. It works with 11.0.

Back to kernel versions: the "problem" is that with a recent net-next.git HEAD, with or without my fix, in a nested VM, the test always passes (20/20). And I can't easily test things non-nested.

I guess could just skip that test for the moment from the set I run git push, and run it manually in the virtual machine, for the moment.

But judging from captures (test_logs/pasta_1.pcap from PCAP=1 ./run) I'm fairly sure it's not *that* issue:

465 12.141763 192.0.2.1 → 88.198.0.164 58451 TCP [TCP Window Full] 34416 → 10001 [PSH, ACK] Seq=10002100 Ack=1 Win=65536 Len=58397 466 12.187195 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0 467 13.187281 192.0.2.1 → 88.198.0.164 4150 TCP 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096

last data transfer from client (rampstream):

468 13.187358 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0

everything acknowledged, migration starts now:

469 14.143217 fe80::f471:c3ff:fe10:4e45 → ff02::2 70 ICMPv6 Router Solicitation from f6:71:c3:10:4e:45 470 14.687123 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] [TCP Keep-Alive] 10001 → 34416 [ACK] Seq=0 Ack=10060497 Win=0 Len=0

migration completed: and we acknowledge the right sequence (10060497), so it didn't jump forward.

But starting from this point:

471 14.687265 192.0.2.1 → 88.198.0.164 60 TCP 34416 → 10001 [ACK] Seq=10060497 Ack=1 Win=65536 Len=0 472 16.687412 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096 473 16.687450 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0 474 20.687650 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096 475 20.687692 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0 476 28.687817 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096

we keep advertising a zero window (that's the kernel doing it really), as if we were unable to dequeue data.

I enabled --trace just for the target instance of passt, and I don't see anything suspicious there:

13.0958: Receiving 1 flows 13.0958: Flow 0 (NEW): FREE -> NEW 13.0958: Flow 0 (TCP connection): TGT -> TYPED 13.0958: Flow 0 (TCP connection): HOST [192.0.2.1]:49892 -> [88.198.0.164]:10001 => TAP [192.0.2.1]:49892 -> [88.198.0.164]:10001 13.0958: Flow 0 (TCP connection): Side 1 hash table insert: bucket: 138154 13.0958: Flow 0 (TCP connection): TYPED -> ACTIVE 13.0958: Flow 0 (TCP connection): HOST [192.0.2.1]:49892 -> [88.198.0.164]:10001 => TAP [192.0.2.1]:49892 -> [88.198.0.164]:10001 13.0959: Flow 0 (TCP connection): Extended migration data, socket 83 sequences send 3121929544 receive 1643895001 13.0959: Flow 0 (TCP connection): pending queues: send 0 not sent 0 receive 3500081 13.0959: Flow 0 (TCP connection): window: snd_wl1 1647395082 snd_wnd 65536 max 65536 rcv_wnd 0 rcv_wup 1647395082 13.0959: Flow 0 (TCP connection): SO_PEEK_OFF disabled offset=0 13.0985: Got packet, but RX virtqueue not usable yet 13.0985: Closing migration channel, fd: 82 13.0985: Closing TCP_REPAIR helper socket 13.0985: passt: epoll event on vhost-user command socket 77 (events: 0x00000001)

then the usual VHOST_USER_CHECK_DEVICE_STATE and VHOST_USER_SET_VRING_ENABLE commands. After that, a tight loop of:

13.0986: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.0986: Got packet, but RX virtqueue not usable yet 13.0986: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.0986: Got packet, but RX virtqueue not usable yet

until we go further with the vhost-user setup. I still see this message which I had never noticed (but I didn't try to bisect around it):

13.1006: ================ Vhost user message ================ 13.1006: Request: VHOST_USER_SET_VRING_ADDR (9) [...] 13.1006: Last avail index != used index: 3252 != 3027

and then after VHOST_USER_SET_VRING_CALL, and:

13.1008: passt: epoll event on vhost-user kick socket 78 (events: 0x00000001) 13.1008: vhost-user: got kick_data: 0000000000000001 idx: 1

it's just a tight loop of:

13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001) 13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001)

as if we weren't dequeueing anything from there.

I start suspecting we might be hitting two different issues: perhaps things fail on your setup because of the kernel bug with TCP_REPAIR not freezing the queue, and they fail on my setup for some other reason.

For me it's very deterministic though: with patch 10/10 things always fail, and without it they never fail.

I guess I'll add more prints and check for more messages before/after that patch.

In fact there is a buffer leak because iov_skip_bytes() doesn't correctly compute the number of used elements and then we don't release all the unused buffers.

I'm trying to fix that.

Please try with series "[PATCH v7 0/4] vhost-user,tcp: Handle multiple iovec entries per virtqueue element" applied, it reworks this part.

I'm trying it now. If that totally reworks this part and it fixes things and it's ready to be merged (sorry, I didn't manage to have a look yet) I don't think it's strictly necessary to figure out the leak.

All tests pass with it, rampstream_in passed 20/20 times. Should I go ahead and merge both series (UDP and TCP, they both look ready) or do you still need to figure out the buffer leak first for other reasons?

No, you can go ahead. Thank,s Laurent

Age (days ago)

Last active (days ago)

List overview

Download

29 comments

4 participants

participants (4)

David Gibson
David GIbson
Laurent Vivier
Stefano Brivio