Re: [PATCH v4 00/10] vhost-user: Preparatory series for multiple iovec entries per virtqueue element

26 May 2026


      On 5/26/26 10:38, Stefano Brivio wrote:
...
On Tue, 26 May 2026 09:59:55 +0200
Stefano Brivio  wrote:
...
On Tue, 26 May 2026 09:31:51 +0200
Laurent Vivier  wrote:
...
On 5/22/26 14:04, Stefano Brivio wrote:
...
On Fri, 22 May 2026 07:44:55 +0200
Stefano Brivio  wrote:
...
On Fri, 22 May 2026 06:22:39 +0200
Stefano Brivio  wrote:
...
On Fri, 22 May 2026 01:13:33 +0200
Laurent Vivier  wrote:
> On 5/21/26 10:30, Laurent Vivier wrote:
>> On 5/20/26 22:53, Stefano Brivio wrote:
>>> On Wed, 20 May 2026 18:18:52 +0200
>>> Stefano Brivio  wrote:
>>>            
>>>> On Wed, 20 May 2026 18:07:08 +0200
>>>> Stefano Brivio  wrote:
>>>>            
>>>>> On Wed, 20 May 2026 17:34:45 +0200
>>>>> Stefano Brivio  wrote:
>>>>>> On Wed, 13 May 2026 13:52:08 +0200
>>>>>> Laurent Vivier  wrote:
>>>>>>> Currently, the vhost-user path assumes each virtqueue element contains
>>>>>>> exactly one iovec entry covering the entire frame.  This assumption
>>>>>>> breaks as some virtio-net drivers (notably iPXE) provide descriptors where the
>>>>>>> vnet header and the frame payload are in separate buffers, resulting in
>>>>>>> two iovec entries per virtqueue element.
>>>>>>>
>>>>>>> This series refactors the vhost-user data path so that frame lengths,
>>>>>>> header sizes, and padding are tracked and passed explicitly rather than
>>>>>>> being derived from iovec sizes.  This decoupling is a prerequisite for
>>>>>>> correctly handling padding of multi-buffer frames.
>>>>>>
>>>>>> Sorry to bring (likely) bad news, but this series seems to introduce a
>>>>>> regression: I got the migration/rampstream_in tests fail twice in a
>>>>>> row, which I've never saw happening (I think I saw a single failure a
>>>>>> long time ago when the machine had a high CPU load, but nothing else).
>>>>>>
>>>>>> I'm currently bisecting and the bisect seems to point towards the end
>>>>>> of the series (probably 10/10), but I haven't finished yet. I'll keep
>>>>>> you posted. I haven't spotted anything that might cause issues there.
>>>>>
>>>>> Yeah, that's the one :(
>>>>>
>>>>> $ git bisect bad
>>>>> db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit
>>>>> commit db798fc60f4c5869cb53168354e068fb4dabd91a
>>>>> Author: Laurent Vivier 
>>>>> Date:   Wed May 13 13:52:18 2026 +0200
>>>>>
>>>>>        vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()
>>
>> I checked on my system with the commit previous to this series,
>> bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not
>> everytime).
>>             
>>    > TCP/IPv4: sequence check, ramps, inbound
>> ...failed.
>>
>> and rampstream_out hangs sometime too.
>>
>> I'm going to try with ealier commits.
>
> For me the problem can happen with any commit...
>
> As it depends on the execution path and on the load and speed of the system it looks like
> a race condition.
Hah, thanks for checking. Maybe...
> Did you try to test on a host with a kernel patched with
> "[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive" ?
Now I tried, and yes, the test doesn't hang anymore! I seem to have an
issue with teardown functions on recent kernels (current net.git HEAD
more or less):
---
[...]
2026/05/22 04:08:23 socat[73089] E connect(5, AF=40 cid:94558 port:22, 16): Connection timed out
Connection closed by UNKNOWN port 65535
...
---
it looks like we stop QEMU a bit too early. But it should be unrelated.
Oops, I forgot to upgrade QEMU on the virtual machine I was using to
test those kernel builds, I had a somewhat outdated 8.1 version and it
failed migration for unrelated reasons. It works with 11.0.
Back to kernel versions: the "problem" is that with a recent
net-next.git HEAD, with or without my fix, in a nested VM, the test
always passes (20/20). And I can't easily test things non-nested.
I guess could just skip that test for the moment from the set I run git
push, and run it manually in the virtual machine, for the moment.
But judging from captures (test_logs/pasta_1.pcap from PCAP=1 ./run)
I'm fairly sure it's not *that* issue:
465  12.141763    192.0.2.1 → 88.198.0.164 58451 TCP [TCP Window Full] 34416 → 10001 [PSH, ACK] Seq=10002100 Ack=1 Win=65536 Len=58397
    466  12.187195 88.198.0.164 → 192.0.2.1    54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0
    467  13.187281    192.0.2.1 → 88.198.0.164 4150 TCP 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096
last data transfer from client (rampstream):
468  13.187358 88.198.0.164 → 192.0.2.1    54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0
everything acknowledged, migration starts now:
469  14.143217 fe80::f471:c3ff:fe10:4e45 → ff02::2      70 ICMPv6 Router Solicitation from f6:71:c3:10:4e:45
    470  14.687123 88.198.0.164 → 192.0.2.1    54 TCP [TCP ZeroWindow] [TCP Keep-Alive] 10001 → 34416 [ACK] Seq=0 Ack=10060497 Win=0 Len=0
migration completed: and we acknowledge the right sequence (10060497),
so it didn't jump forward.
But starting from this point:
471  14.687265    192.0.2.1 → 88.198.0.164 60 TCP 34416 → 10001 [ACK] Seq=10060497 Ack=1 Win=65536 Len=0
    472  16.687412    192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096
    473  16.687450 88.198.0.164 → 192.0.2.1    54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0
    474  20.687650    192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096
    475  20.687692 88.198.0.164 → 192.0.2.1    54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0
    476  28.687817    192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096
we keep advertising a zero window (that's the kernel doing it really),
as if we were unable to dequeue data.
I enabled --trace just for the target instance of passt, and I don't
see anything suspicious there:
13.0958: Receiving 1 flows
13.0958: Flow 0 (NEW): FREE -> NEW
13.0958: Flow 0 (TCP connection): TGT -> TYPED
13.0958: Flow 0 (TCP connection): HOST [192.0.2.1]:49892 -> [88.198.0.164]:10001 => TAP [192.0.2.1]:49892 -> [88.198.0.164]:10001
13.0958: Flow 0 (TCP connection): Side 1 hash table insert: bucket: 138154
13.0958: Flow 0 (TCP connection): TYPED -> ACTIVE
13.0958: Flow 0 (TCP connection): HOST [192.0.2.1]:49892 -> [88.198.0.164]:10001 => TAP [192.0.2.1]:49892 -> [88.198.0.164]:10001
13.0959: Flow 0 (TCP connection): Extended migration data, socket 83 sequences send 3121929544 receive 1643895001
13.0959: Flow 0 (TCP connection):   pending queues: send 0 not sent 0 receive 3500081
13.0959: Flow 0 (TCP connection):   window: snd_wl1 1647395082 snd_wnd 65536 max 65536 rcv_wnd 0 rcv_wup 1647395082
13.0959: Flow 0 (TCP connection):   SO_PEEK_OFF disabled  offset=0
13.0985: Got packet, but RX virtqueue not usable yet
13.0985: Closing migration channel, fd: 82
13.0985: Closing TCP_REPAIR helper socket
13.0985: passt: epoll event on vhost-user command socket 77 (events: 0x00000001)
then the usual VHOST_USER_CHECK_DEVICE_STATE and VHOST_USER_SET_VRING_ENABLE
commands. After that, a tight loop of:
13.0986: passt: epoll event on connected TCP socket 83 (events: 0x00000001)
13.0986: Got packet, but RX virtqueue not usable yet
13.0986: passt: epoll event on connected TCP socket 83 (events: 0x00000001)
13.0986: Got packet, but RX virtqueue not usable yet
until we go further with the vhost-user setup. I still see this message
which I had never noticed (but I didn't try to bisect around it):
13.1006: ================ Vhost user message ================
13.1006: Request: VHOST_USER_SET_VRING_ADDR (9)
[...]
13.1006: Last avail index != used index: 3252 != 3027
and then after VHOST_USER_SET_VRING_CALL, and:
13.1008: passt: epoll event on vhost-user kick socket 78 (events: 0x00000001)
13.1008: vhost-user: got kick_data: 0000000000000001 idx: 1
it's just a tight loop of:
13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001)
13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001)
13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001)
as if we weren't dequeueing anything from there.
I start suspecting we might be hitting two different issues: perhaps
things fail on your setup because of the kernel bug with TCP_REPAIR not
freezing the queue, and they fail on my setup for some other reason.
For me it's very deterministic though: with patch 10/10 things always
fail, and without it they never fail.
I guess I'll add more prints and check for more messages before/after
that patch.
In fact there is a buffer leak because iov_skip_bytes() doesn't correctly compute the
number of used elements and then we don't release all the unused buffers.
I'm trying to fix that.
Please try with series "[PATCH v7 0/4] vhost-user,tcp: Handle multiple iovec entries per
virtqueue element" applied, it reworks this part.
I'm trying it now. If that totally reworks this part and it fixes
things and it's ready to be merged (sorry, I didn't manage to have a
look yet) I don't think it's strictly necessary to figure out the
leak.
All tests pass with it, rampstream_in passed 20/20 times. Should I go
ahead and merge both series (UDP and TCP, they both look ready) or do
you still need to figure out the buffer leak first for other reasons?
No, you can go ahead.

Thank,s
Laurent