Hi, as reported by Stefano there is an asymmetry in the throughput between host and guest with vhost-user. I've tested the following kernel patch from Jason to see if it can improve the performance: ------------------------------------------------------------------------------ commit e13b6da7045f997e1a5a5efd61d40e63c4fc20e8 Author: Jason Wang <jasowang(a)redhat.com> Date: Tue Feb 18 10:39:08 2025 +0800 virtio-net: tweak for better TX performance in NAPI mode There are several issues existed in start_xmit(): - Transmitted packets need to be freed before sending a packet, this introduces delay and increases the average packets transmit time. This also increase the time that spent in holding the TX lock. - Notification is enabled after free_old_xmit_skbs() which will introduce unnecessary interrupts if TX notification happens on the same CPU that is doing the transmission now (actually, virtio-net driver are optimized for this case). So this patch tries to avoid those issues by not cleaning transmitted packets in start_xmit() when TX NAPI is enabled and disable notifications even more aggressively. Notification will be since the beginning of the start_xmit(). But we can't enable delayed notification after TX is stopped as we will lose the notifications. Instead, the delayed notification needs is enabled after the virtqueue is kicked for best performance. Performance numbers: 1) single queue 2 vcpus guest with pktgen_sample03_burst_single_flow.sh (burst 256) + testpmd (rxonly) on the host: - When pinning TX IRQ to pktgen VCPU: split virtqueue PPS were increased 55% from 6.89 Mpps to 10.7 Mpps and 32% TX interrupts were eliminated. Packed virtqueue PPS were increased 50% from 7.09 Mpps to 10.7 Mpps, 99% TX interrupts were eliminated. - When pinning TX IRQ to VCPU other than pktgen: split virtqueue PPS were increased 96% from 5.29 Mpps to 10.4 Mpps and 45% TX interrupts were eliminated; Packed virtqueue PPS were increased 78% from 6.12 Mpps to 10.9 Mpps and 99% TX interrupts were eliminated. 2) single queue 1 vcpu guest + vhost-net/TAP on the host: single session netperf from guest to host shows 82% improvement from 31Gb/s to 58Gb/s, %stddev were reduced from 34.5% to 1.9% and 88% of TX interrupts were eliminated. Signed-off-by: Jason Wang <jasowang(a)redhat.com> Acked-by: Michael S. Tsirkin <mst(a)redhat.com> Signed-off-by: David S. Miller <davem(a)davemloft.net> ------------------------------------------------------------------------------ systemctl stop firewalld.service || service iptables stop || iptables -Ft /sbin/sysctl -w net.core.rmem_max=536870912 /sbin/sysctl -w net.core.wmem_max=536870912 ____ I made my test using 6.14-rc7 kernel: From guest: iperf3 -c 10.6.68.254 -P2 -Z -t5 -l 1M -w 16M [SUM] 0.00-5.00 sec 14.5 GBytes 24.9 Gbits/sec 0 sender [SUM] 0.00-5.00 sec 14.5 GBytes 24.9 Gbits/sec receiver From host: iperf3 -c localhost -P2 -Z -t5 -p 10001 -l 1M -w 16M [SUM] 0.00-5.00 sec 28.9 GBytes 49.6 Gbits/sec 0 sender [SUM] 0.00-5.03 sec 28.8 GBytes 49.2 Gbits/sec receiver ____ The results with a 6.14-rc7 + e13b6da7045f: From guest: iperf3 -c 10.6.68.254 -P2 -Z -t5 -l 1M -w 16M [SUM] 0.00-5.00 sec 14.8 GBytes 25.4 Gbits/sec 0 sender [SUM] 0.00-5.01 sec 14.8 GBytes 25.4 Gbits/sec receiver From host: iperf3 -c localhost -P2 -Z -t5 -p 10001 -l 1M -w 16M [SUM] 0.00-5.00 sec 28.5 GBytes 48.9 Gbits/sec 0 sender [SUM] 0.00-5.03 sec 28.4 GBytes 48.6 Gbits/sec receiver We have only a 2% improvement. Thanks, Laurent