On Wed, 21 Dec 2022 17:00:24 +1100
David Gibson
On Tue, Dec 20, 2022 at 11:42:46AM +0100, Stefano Brivio wrote:
Sorry for the further delay,
On Wed, 14 Dec 2022 11:35:46 +0100 Stefano Brivio
wrote: On Wed, 14 Dec 2022 12:42:14 +1100 David Gibson
wrote: On Tue, Dec 13, 2022 at 11:48:47PM +0100, Stefano Brivio wrote:
Sorry for the long delay here,
On Mon, 5 Dec 2022 19:14:21 +1100 David Gibson
wrote: Usually udp_sock_handler() will receive and forward multiple (up to 32) datagrams in udp_sock_handler(), then forward them all to the tap interface. For unclear reasons, though, when in pasta mode we will only receive and forward a single datagram at a time. Change it to receive multiple datagrams at once, like the other paths.
This is explained in the commit message of 6c931118643c ("tcp, udp: Receive batching doesn't pay off when writing single frames to tap").
I think it's worth re-checking the throughput now as this path is a bit different, but unfortunately I didn't include this in the "perf" tests :( because at the time I introduced those I wasn't sure it even made sense to have traffic from the same host being directed to the tap device.
The iperf3 runs were I observed this are actually the ones from the Podman demo. Ideally that case should be also checked in the perf/pasta_udp tests.
Hm, ok.
How fundamental is this for the rest of the series? I couldn't find any actual dependency on this but I might be missing something.
So the issue is that prior to this change in pasta we receive multiple frames at once on the splice path, but one frame at a time on the tap path. By the end of this series we can't do that any more, because we don't know before the recvmmsg() which one we'll be doing.
Oh, right, I see. Then let me add this path to the perf/pasta_udp test and check how relevant this is now, I'll get back to you in a bit.
I was checking the wrong path. With this:
diff --git a/test/perf/pasta_udp b/test/perf/pasta_udp index 27ea724..973c2f4 100644 --- a/test/perf/pasta_udp +++ b/test/perf/pasta_udp @@ -31,6 +31,14 @@ report pasta lo_udp 1 __FREQ__
th MTU 1500B 4000B 16384B 65535B
+tr UDP throughput over IPv6: host to ns +nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' +nsout ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .prefixlen == 64).local' +bw - +bw - +bw - +iperf3 BW host ns __ADDR6__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 15G +bw __BW__ 7.0 9.0
tr UDP throughput over IPv6: ns to host ns ip link set dev lo mtu 1500 diff --git a/test/run b/test/run index e07513f..b53182b 100755 --- a/test/run +++ b/test/run @@ -67,6 +67,14 @@ run() { test build/clang_tidy teardown build
+ VALGRIND=0 + setup passt_in_ns + test passt/ndp + test passt/dhcp + test perf/pasta_udp + test passt_in_ns/shutdown + teardown passt_in_ns + setup pasta test pasta/ndp test pasta/dhcp
Ah, ok. Can we add that to the standard set of tests ASAP, please.
Yes -- that part itself was easy, but now I'm fighting against my own finest write-only code that generates the JavaScript snippet for the performance report (perf_fill_lines() in test/lib/perf_report -- and this is not a suggestion to have a look at it ;)). I'm trying to rework it a bit together with the "new" test.
I get 21.6 gbps after this series, and 29.7 gbps before -- it's quite significant.
Drat.
And there's nothing strange in perf's output, really, the distribution of overhead per functions is pretty much the same, but writing multiple messages to the tap device just takes more cycles per message compared to a single message.
That's so weird. It should be basically an identical set of write()s, except that they happen in a batch, rather than a bit spread out. I guess it has to be some kind of cache locality thing. I wonder if the difference would go away or reverse if we had a way to submit multiple frames with a single syscall.
I haven't tried, but to test this, I think we could actually just write multiple frames in a single call, with subsequent headers and everything, and the iperf3 server will simply report how many bytes it received.
I'm a bit ashamed to propose this, but do you think about something like:
if (c->mode == MODE_PASTA) { if (recvmmsg(ref.r.s, mmh_recv, 1, 0, NULL) <= 0) return;
if (udp_mmh_splice_port(v6, mmh_recv)) { n = recvmmsg(ref.r.s, mmh_recv + 1, UDP_MAX_FRAMES - 1, 0, NULL); }
if (n > 0) n++; else n = 1; } else { n = recvmmsg(ref.r.s, mmh_recv, UDP_MAX_FRAMES, 0, NULL); if (n <= 0) return; }
? Other than the inherent ugliness, it looks like a good approximation to me.
Hmm. Well, the first question is how much impact does going 1 message at a time have on the spliced throughput. If it's not too bad, then we could just always go one at a time for pasta, regardless of splicing. And we could even abstract that difference into the tap backend with a callback like tap_batch_size(c).
Right... it used to be significantly worse in the "spliced" case, I checked that when I did that commit to use 1 instead of UDP_MAX_FRAME in the other case, but I don't have data. I'll test this again. -- Stefano