When reading received messages with MSG_PEEK, we sometines have to read the leading bytes of the stream several times, only to reach the bytes we really want. This is clearly non-optimal. What we would want is something similar to pread/preadv(), but working even for tcp sockets. At the same time, we don't want to add any new arguments to the recv/recvmsg() calls. In this commit, we allow the user to set iovec.iov_base in the first vector entry to NULL. This tells the socket to skip the first entry, hence letting the iov_len field of that entry indicate the offset value. This way, there is no need to add any new arguments or flags. In the iperf3 logs examples shown below, we can observe a throughput improvement of ~20 % in the direction host->namespace when using the protocol splicer 'passt'. This is a consistent result. $ ./passt/passt/pasta --config-net -f MSG_PEEK with offset not supported. [root@fedora37 ~]# perf record iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 60344 [ 6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 60360 [ ID] Interval Transfer Bitrate {...] [ 6] 13.00-14.00 sec 2.54 GBytes 21.8 Gbits/sec [ 6] 14.00-15.00 sec 2.52 GBytes 21.7 Gbits/sec [ 6] 15.00-16.00 sec 2.50 GBytes 21.5 Gbits/sec [ 6] 16.00-17.00 sec 2.49 GBytes 21.4 Gbits/sec [ 6] 17.00-18.00 sec 2.51 GBytes 21.6 Gbits/sec [ 6] 18.00-19.00 sec 2.48 GBytes 21.3 Gbits/sec [ 6] 19.00-20.00 sec 2.49 GBytes 21.4 Gbits/sec [ 6] 20.00-20.04 sec 87.4 MBytes 19.2 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 6] 0.00-20.04 sec 48.9 GBytes 21.0 Gbits/sec receiver ----------------------------------------------------------- [jmaloy@fedora37 ~]$ ./passt/passt/pasta --config-net -f MSG_PEEK with offset supported. [root@fedora37 ~]# perf record iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 46362 [ 6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 46374 [ ID] Interval Transfer Bitrate [...] [ 6] 12.00-13.00 sec 3.18 GBytes 27.3 Gbits/sec [ 6] 13.00-14.00 sec 3.17 GBytes 27.3 Gbits/sec [ 6] 14.00-15.00 sec 3.13 GBytes 26.9 Gbits/sec [ 6] 15.00-16.00 sec 3.17 GBytes 27.3 Gbits/sec [ 6] 16.00-17.00 sec 3.17 GBytes 27.2 Gbits/sec [ 6] 17.00-18.00 sec 3.14 GBytes 27.0 Gbits/sec [ 6] 18.00-19.00 sec 3.17 GBytes 27.2 Gbits/sec [ 6] 19.00-20.00 sec 3.12 GBytes 26.8 Gbits/sec [ 6] 20.00-20.04 sec 119 MBytes 25.5 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 6] 0.00-20.04 sec 59.4 GBytes 25.4 Gbits/sec receiver ----------------------------------------------------------- Passt is used to support VMs in containers, such as KubeVirt, and is also generally supported in libvirt/QEMU since release 9.2 / 7.2. Signed-off-by: Jon Maloy <jmaloy(a)redhat.com> --- net/ipv4/tcp.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 53bcc17c91e4..e9d3b5bf2f66 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2310,6 +2310,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int *cmsg_flags) { struct tcp_sock *tp = tcp_sk(sk); + size_t peek_offset; int copied = 0; u32 peek_seq; u32 *seq; @@ -2353,6 +2354,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, if (flags & MSG_PEEK) { peek_seq = tp->copied_seq; seq = &peek_seq; + if (!msg->msg_iter.__iov[0].iov_base) { + peek_offset = msg->msg_iter.__iov[0].iov_len; + msg->msg_iter.__iov = &msg->msg_iter.__iov[1]; + if (msg->msg_iter.nr_segs <= 1) + goto out; + msg->msg_iter.nr_segs -= 1; + if (msg->msg_iter.count <= peek_offset) + goto out; + msg->msg_iter.count -= peek_offset; + if (len <= peek_offset) + goto out; + len -= peek_offset; + *seq += peek_offset; + } } target = sock_rcvlowat(sk, flags & MSG_WAITALL, len); -- 2.39.0
Note that I only sent this one to passt-dev, not netdev. I would appreciate feedback and possible ack/reviewed-by as soon as possible so I can send it to netdev. ///jon On 2023-12-05 18:20, Jon Maloy wrote:When reading received messages with MSG_PEEK, we sometines have to read the leading bytes of the stream several times, only to reach the bytes we really want. This is clearly non-optimal. What we would want is something similar to pread/preadv(), but working even for tcp sockets. At the same time, we don't want to add any new arguments to the recv/recvmsg() calls. In this commit, we allow the user to set iovec.iov_base in the first vector entry to NULL. This tells the socket to skip the first entry, hence letting the iov_len field of that entry indicate the offset value. This way, there is no need to add any new arguments or flags. In the iperf3 logs examples shown below, we can observe a throughput improvement of ~20 % in the direction host->namespace when using the protocol splicer 'passt'. This is a consistent result. $ ./passt/passt/pasta --config-net -f MSG_PEEK with offset not supported. [root@fedora37 ~]# perf record iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 60344 [ 6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 60360 [ ID] Interval Transfer Bitrate {...] [ 6] 13.00-14.00 sec 2.54 GBytes 21.8 Gbits/sec [ 6] 14.00-15.00 sec 2.52 GBytes 21.7 Gbits/sec [ 6] 15.00-16.00 sec 2.50 GBytes 21.5 Gbits/sec [ 6] 16.00-17.00 sec 2.49 GBytes 21.4 Gbits/sec [ 6] 17.00-18.00 sec 2.51 GBytes 21.6 Gbits/sec [ 6] 18.00-19.00 sec 2.48 GBytes 21.3 Gbits/sec [ 6] 19.00-20.00 sec 2.49 GBytes 21.4 Gbits/sec [ 6] 20.00-20.04 sec 87.4 MBytes 19.2 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 6] 0.00-20.04 sec 48.9 GBytes 21.0 Gbits/sec receiver ----------------------------------------------------------- [jmaloy@fedora37 ~]$ ./passt/passt/pasta --config-net -f MSG_PEEK with offset supported. [root@fedora37 ~]# perf record iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 46362 [ 6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 46374 [ ID] Interval Transfer Bitrate [...] [ 6] 12.00-13.00 sec 3.18 GBytes 27.3 Gbits/sec [ 6] 13.00-14.00 sec 3.17 GBytes 27.3 Gbits/sec [ 6] 14.00-15.00 sec 3.13 GBytes 26.9 Gbits/sec [ 6] 15.00-16.00 sec 3.17 GBytes 27.3 Gbits/sec [ 6] 16.00-17.00 sec 3.17 GBytes 27.2 Gbits/sec [ 6] 17.00-18.00 sec 3.14 GBytes 27.0 Gbits/sec [ 6] 18.00-19.00 sec 3.17 GBytes 27.2 Gbits/sec [ 6] 19.00-20.00 sec 3.12 GBytes 26.8 Gbits/sec [ 6] 20.00-20.04 sec 119 MBytes 25.5 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 6] 0.00-20.04 sec 59.4 GBytes 25.4 Gbits/sec receiver ----------------------------------------------------------- Passt is used to support VMs in containers, such as KubeVirt, and is also generally supported in libvirt/QEMU since release 9.2 / 7.2. Signed-off-by: Jon Maloy <jmaloy(a)redhat.com> --- net/ipv4/tcp.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 53bcc17c91e4..e9d3b5bf2f66 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2310,6 +2310,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int *cmsg_flags) { struct tcp_sock *tp = tcp_sk(sk); + size_t peek_offset; int copied = 0; u32 peek_seq; u32 *seq; @@ -2353,6 +2354,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, if (flags & MSG_PEEK) { peek_seq = tp->copied_seq; seq = &peek_seq; + if (!msg->msg_iter.__iov[0].iov_base) { + peek_offset = msg->msg_iter.__iov[0].iov_len; + msg->msg_iter.__iov = &msg->msg_iter.__iov[1]; + if (msg->msg_iter.nr_segs <= 1) + goto out; + msg->msg_iter.nr_segs -= 1; + if (msg->msg_iter.count <= peek_offset) + goto out; + msg->msg_iter.count -= peek_offset; + if (len <= peek_offset) + goto out; + len -= peek_offset; + *seq += peek_offset; + } } target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
On Tue, 5 Dec 2023 18:20:28 -0500 Jon Maloy <jmaloy(a)redhat.com> wrote:When reading received messages with MSG_PEEK, we sometines have to read the leading bytes of the stream several times, only to reach the bytes we really want. This is clearly non-optimal.I'm not sure there are many other usage patterns like this outside passt -- I would simply state that if we want to peek with an offset, we can't. And perhaps explain why passt(1) and pasta(1) need to do that.What we would want is something similar to pread/preadv(), but working even for tcp sockets. At the same time, we don't want to add any new arguments to the recv/recvmsg() calls. In this commit, we allow the user to set iovec.iov_base in the first vector entry to NULL. This tells the socket to skip the first entry, hence letting the iov_len field of that entry indicate the offset value. This way, there is no need to add any new arguments or flags. In the iperf3 logs examples shown below, we can observe a throughput improvement of ~20 % in the direction host->namespace when using the protocol splicer 'passt'. This is a consistent result.I'm not sure how widely known it is, I would add a link (https://passt.top).$ ./passt/passt/pasta --config-net -f MSG_PEEK with offset not supported. [root@fedora37 ~]# perf record iperf3 -sHere you're profiling iperf3 (not pasta), but not showing the results of the profiling. Indeed, if you have a consistent throughput improvement, that's also great (and great to show), but there's no need to profile iperf3 -- I don't expect any difference from its point of view.----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 60344 [ 6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 60360 [ ID] Interval Transfer Bitrate {...] [ 6] 13.00-14.00 sec 2.54 GBytes 21.8 Gbits/sec [ 6] 14.00-15.00 sec 2.52 GBytes 21.7 Gbits/sec [ 6] 15.00-16.00 sec 2.50 GBytes 21.5 Gbits/sec [ 6] 16.00-17.00 sec 2.49 GBytes 21.4 Gbits/sec [ 6] 17.00-18.00 sec 2.51 GBytes 21.6 Gbits/sec [ 6] 18.00-19.00 sec 2.48 GBytes 21.3 Gbits/sec [ 6] 19.00-20.00 sec 2.49 GBytes 21.4 Gbits/sec [ 6] 20.00-20.04 sec 87.4 MBytes 19.2 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 6] 0.00-20.04 sec 48.9 GBytes 21.0 Gbits/sec receiver ----------------------------------------------------------- [jmaloy@fedora37 ~]$ ./passt/passt/pasta --config-net -f MSG_PEEK with offset supported. [root@fedora37 ~]# perf record iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 46362 [ 6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 46374 [ ID] Interval Transfer Bitrate [...] [ 6] 12.00-13.00 sec 3.18 GBytes 27.3 Gbits/sec [ 6] 13.00-14.00 sec 3.17 GBytes 27.3 Gbits/sec [ 6] 14.00-15.00 sec 3.13 GBytes 26.9 Gbits/sec [ 6] 15.00-16.00 sec 3.17 GBytes 27.3 Gbits/sec [ 6] 16.00-17.00 sec 3.17 GBytes 27.2 Gbits/sec [ 6] 17.00-18.00 sec 3.14 GBytes 27.0 Gbits/sec [ 6] 18.00-19.00 sec 3.17 GBytes 27.2 Gbits/sec [ 6] 19.00-20.00 sec 3.12 GBytes 26.8 Gbits/sec [ 6] 20.00-20.04 sec 119 MBytes 25.5 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 6] 0.00-20.04 sec 59.4 GBytes 25.4 Gbits/sec receiver -----------------------------------------------------------...that is, what I personally find more conclusive is that the overhead spent in ____sys_recvmsg(), or tcp_recvmsg_locked(), decreases dramatically with this: $ perf record -g ./passt -f -t 5201 [...] $ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data.old_peek | head -1 57.16% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg $ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data.new_peek | head -1 38.66% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg those command lines are a bit convoluted. I guess running pasta or passt with 'perf stat' and selecting 'cycles' event for the interesting symbols might be more obvious.Passt is used to support VMs in containers, such as KubeVirt, and is also generally supported in libvirt/QEMU since release 9.2 / 7.2.Not just VMs in containers... but yes, that was the original use case. I find it a bit confusing that you're using pasta(1) in the example but mentioning passt(1) (perhaps it's my fault ;)) -- maybe mention that pasta(1) is used with containers (Podman?) instead?Signed-off-by: Jon Maloy <jmaloy(a)redhat.com> --- net/ipv4/tcp.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 53bcc17c91e4..e9d3b5bf2f66 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2310,6 +2310,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int *cmsg_flags) { struct tcp_sock *tp = tcp_sk(sk); + size_t peek_offset;This could be moved where it's needed.int copied = 0; u32 peek_seq; u32 *seq; @@ -2353,6 +2354,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, if (flags & MSG_PEEK) { peek_seq = tp->copied_seq; seq = &peek_seq; + if (!msg->msg_iter.__iov[0].iov_base) { + peek_offset = msg->msg_iter.__iov[0].iov_len; + msg->msg_iter.__iov = &msg->msg_iter.__iov[1]; + if (msg->msg_iter.nr_segs <= 1) + goto out;'err' shouldn't be ENOTCONN here (that's why I got that cryptic error when I messed up recvmsg() while reviewing the other patch). EINVAL would make more sense. I haven't checked the other cases.+ msg->msg_iter.nr_segs -= 1; + if (msg->msg_iter.count <= peek_offset) + goto out;I find it a bit difficult to follow these checks interleaved with assignments. That is, I've been wondering for a while why you would want to check for msg_iter.count <= peek_offset only after decreasing the number of segments, only to find out that there's actually no relationship between the two things. Maybe newlines between different parts of the overall logic would help.+ msg->msg_iter.count -= peek_offset; + if (len <= peek_offset) + goto out; + len -= peek_offset; + *seq += peek_offset; + } } target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);-- Stefano