Re: [PATCH v4 00/10] vhost-user: Preparatory series for multiple iovec entries per virtqueue element

22 May 2026

      On Fri, 22 May 2026 06:22:39 +0200
Stefano Brivio  wrote:
...
On Fri, 22 May 2026 01:13:33 +0200
Laurent Vivier  wrote:
...
On 5/21/26 10:30, Laurent Vivier wrote:
...
On 5/20/26 22:53, Stefano Brivio wrote:
...
On Wed, 20 May 2026 18:18:52 +0200
Stefano Brivio  wrote:
...
On Wed, 20 May 2026 18:07:08 +0200
Stefano Brivio  wrote:
...
On Wed, 20 May 2026 17:34:45 +0200
Stefano Brivio  wrote:    
> On Wed, 13 May 2026 13:52:08 +0200
> Laurent Vivier  wrote:    
>> Currently, the vhost-user path assumes each virtqueue element contains
>> exactly one iovec entry covering the entire frame.  This assumption
>> breaks as some virtio-net drivers (notably iPXE) provide descriptors where the
>> vnet header and the frame payload are in separate buffers, resulting in
>> two iovec entries per virtqueue element.
>>
>> This series refactors the vhost-user data path so that frame lengths,
>> header sizes, and padding are tracked and passed explicitly rather than
>> being derived from iovec sizes.  This decoupling is a prerequisite for
>> correctly handling padding of multi-buffer frames.    
>
> Sorry to bring (likely) bad news, but this series seems to introduce a
> regression: I got the migration/rampstream_in tests fail twice in a
> row, which I've never saw happening (I think I saw a single failure a
> long time ago when the machine had a high CPU load, but nothing else).
>
> I'm currently bisecting and the bisect seems to point towards the end
> of the series (probably 10/10), but I haven't finished yet. I'll keep
> you posted. I haven't spotted anything that might cause issues there.
Yeah, that's the one :(
$ git bisect bad
db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit
commit db798fc60f4c5869cb53168354e068fb4dabd91a
Author: Laurent Vivier 
Date:   Wed May 13 13:52:18 2026 +0200
     vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()
I checked on my system with the commit previous to this series,
bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not 
everytime).
...
TCP/IPv4: sequence check, ramps, inbound    
...failed.
and rampstream_out hangs sometime too.
I'm going to try with ealier commits.
For me the problem can happen with any commit...
As it depends on the execution path and on the load and speed of the system it looks like 
a race condition.
Hah, thanks for checking. Maybe...
...
Did you try to test on a host with a kernel patched with
"[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive" ?
Now I tried, and yes, the test doesn't hang anymore! I seem to have an
issue with teardown functions on recent kernels (current net.git HEAD
more or less):
---
+ teardown_migrate
+ cat /tmp/passt-tests-VVtLn0/migrate/qemu_1.pid
+ /home/sbrivio/passt/test/nstool exec /tmp/passt-tests-VVtLn0/migrate/ns1.hold -- kill 16
qemu-system-x86_64: terminating on signal 15 from pid 34 ()
+ cat /tmp/passt-tests-VVtLn0/migrate/qemu_2.pid
+ /home/sbrivio/passt/test/nstool exec /tmp/passt-tests-VVtLn0/migrate/ns1.hold -- kill 15
18.8974: ================ Vhost user message ================
18.8974: Request: VHOST_USER_GET_VRING_BASE (11)
18.8974: Flags:   0x1
18.8974: Size:    8
18.8974: State.index: 0
18.8975: ================ Vhost user message ================
18.8975: Request: VHOST_USER_GET_VRING_BASE (11)
18.8975: Flags:   0x1
18.8975: Size:    8
18.8975: State.index: 1
qemu-system-x86_64: terminating on signal 15 from pid 35 ()
18.7961: Client connection closed
18.7962: Closing TCP_REPAIR helper socket
+ context_wait qemu_1
+ __name=qemu_1
+ __pidfile=/tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid
+ cat /tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid
+ rc=0
+ rm /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_2.stdout.9pwpVbQr /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_2.stderr.dSY5hBu1
+ __pid=67766
+ rm /tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid
+ [ 1 -eq 1 ]
+ echo [Exit code: 0]
+ echo -n passt_repair_2$ 
+ return 0
18.9016: Client connection closed
18.9018: Closing TCP_REPAIR helper socket
+ wait 67766
+ rc=0
+ rm /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_1.stdout.JEyDGxXe /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_1.stderr.WU550iEI
+ [ 1 -eq 1 ]
+ echo [Exit code: 0]
+ echo -n passt_repair_1$ 
+ return 0
+ rc=0
+ rm /tmp/passt-tests-VVtLn0/migrate/context_qemu_2.stdout.Dm8EAhfl /tmp/passt-tests-VVtLn0/migrate/context_qemu_2.stderr.207qJYPA
+ [ 1 -eq 1 ]
+ echo [Exit code: 0]
+ echo -n qemu_2$ 
+ return 0
2026/05/22 04:08:23 socat[73089] E connect(5, AF=40 cid:94558 port:22, 16): Connection timed out
Connection closed by UNKNOWN port 65535
...
---
it looks like we stop QEMU a bit too early. But it should be unrelated.
I'm now trying to find some kind of workaround for existing (not fixed)
kernel versions. Maybe stopping rampstream_in for a moment or something
like that.
For some weird reason even very blatant throttling (100 ms - 1 s delays
every 10000 ramps, or an explicit 500 ms pause via signal before
migration) doesn't help.

So it doesn't seem to be *that* kind of race. I should probably check
the same exact kernel version with fix and without...

-- 
Stefano