On Fri, 22 May 2026 06:22:39 +0200
Stefano Brivio
On Fri, 22 May 2026 01:13:33 +0200 Laurent Vivier
wrote: On 5/21/26 10:30, Laurent Vivier wrote:
On 5/20/26 22:53, Stefano Brivio wrote:
On Wed, 20 May 2026 18:18:52 +0200 Stefano Brivio
wrote: On Wed, 20 May 2026 18:07:08 +0200 Stefano Brivio
wrote: On Wed, 20 May 2026 17:34:45 +0200 Stefano Brivio
wrote: > On Wed, 13 May 2026 13:52:08 +0200 > Laurent Vivier wrote: >> Currently, the vhost-user path assumes each virtqueue element contains >> exactly one iovec entry covering the entire frame. This assumption >> breaks as some virtio-net drivers (notably iPXE) provide descriptors where the >> vnet header and the frame payload are in separate buffers, resulting in >> two iovec entries per virtqueue element. >> >> This series refactors the vhost-user data path so that frame lengths, >> header sizes, and padding are tracked and passed explicitly rather than >> being derived from iovec sizes. This decoupling is a prerequisite for >> correctly handling padding of multi-buffer frames. > > Sorry to bring (likely) bad news, but this series seems to introduce a > regression: I got the migration/rampstream_in tests fail twice in a > row, which I've never saw happening (I think I saw a single failure a > long time ago when the machine had a high CPU load, but nothing else). > > I'm currently bisecting and the bisect seems to point towards the end > of the series (probably 10/10), but I haven't finished yet. I'll keep > you posted. I haven't spotted anything that might cause issues there. Yeah, that's the one :(
$ git bisect bad db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit commit db798fc60f4c5869cb53168354e068fb4dabd91a Author: Laurent Vivier
Date: Wed May 13 13:52:18 2026 +0200 vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()
I checked on my system with the commit previous to this series, bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not everytime).
TCP/IPv4: sequence check, ramps, inbound ...failed.
and rampstream_out hangs sometime too.
I'm going to try with ealier commits.
For me the problem can happen with any commit...
As it depends on the execution path and on the load and speed of the system it looks like a race condition.
Hah, thanks for checking. Maybe...
Did you try to test on a host with a kernel patched with "[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive" ?
Now I tried, and yes, the test doesn't hang anymore! I seem to have an issue with teardown functions on recent kernels (current net.git HEAD more or less):
--- + teardown_migrate + cat /tmp/passt-tests-VVtLn0/migrate/qemu_1.pid + /home/sbrivio/passt/test/nstool exec /tmp/passt-tests-VVtLn0/migrate/ns1.hold -- kill 16 qemu-system-x86_64: terminating on signal 15 from pid 34 () + cat /tmp/passt-tests-VVtLn0/migrate/qemu_2.pid + /home/sbrivio/passt/test/nstool exec /tmp/passt-tests-VVtLn0/migrate/ns1.hold -- kill 15 18.8974: ================ Vhost user message ================ 18.8974: Request: VHOST_USER_GET_VRING_BASE (11) 18.8974: Flags: 0x1 18.8974: Size: 8 18.8974: State.index: 0 18.8975: ================ Vhost user message ================ 18.8975: Request: VHOST_USER_GET_VRING_BASE (11) 18.8975: Flags: 0x1 18.8975: Size: 8 18.8975: State.index: 1 qemu-system-x86_64: terminating on signal 15 from pid 35 () 18.7961: Client connection closed 18.7962: Closing TCP_REPAIR helper socket + context_wait qemu_1 + __name=qemu_1 + __pidfile=/tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + cat /tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_2.stdout.9pwpVbQr /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_2.stderr.dSY5hBu1 + __pid=67766 + rm /tmp/passt-tests-VVtLn0/migrate/context_qemu_1.pid + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n passt_repair_2$ + return 0 18.9016: Client connection closed 18.9018: Closing TCP_REPAIR helper socket + wait 67766 + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_1.stdout.JEyDGxXe /tmp/passt-tests-VVtLn0/migrate/context_passt_repair_1.stderr.WU550iEI + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n passt_repair_1$ + return 0 + rc=0 + rm /tmp/passt-tests-VVtLn0/migrate/context_qemu_2.stdout.Dm8EAhfl /tmp/passt-tests-VVtLn0/migrate/context_qemu_2.stderr.207qJYPA + [ 1 -eq 1 ] + echo [Exit code: 0] + echo -n qemu_2$ + return 0 2026/05/22 04:08:23 socat[73089] E connect(5, AF=40 cid:94558 port:22, 16): Connection timed out Connection closed by UNKNOWN port 65535 ... ---
it looks like we stop QEMU a bit too early. But it should be unrelated.
I'm now trying to find some kind of workaround for existing (not fixed) kernel versions. Maybe stopping rampstream_in for a moment or something like that.
For some weird reason even very blatant throttling (100 ms - 1 s delays every 10000 ramps, or an explicit 500 ms pause via signal before migration) doesn't help. So it doesn't seem to be *that* kind of race. I should probably check the same exact kernel version with fix and without... -- Stefano